diff --git a/_images/encoder_decoder.png b/_images/encoder_decoder.png new file mode 100644 index 0000000..163d337 Binary files /dev/null and b/_images/encoder_decoder.png differ diff --git a/_images/muchomusic.png b/_images/muchomusic.png new file mode 100644 index 0000000..ea107eb Binary files /dev/null and b/_images/muchomusic.png differ diff --git a/_sources/description/datasets.ipynb b/_sources/description/datasets.ipynb index 7218604..23b2055 100644 --- a/_sources/description/datasets.ipynb +++ b/_sources/description/datasets.ipynb @@ -9,6 +9,8 @@ "\n", "Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn't directly distributed with the dataset and may be subject to copyright. \n", "\n", + "```{table} Music description datasets.\n", + ":name: description_datasets\n", "| Dataset | Content | Size (# annotations) | Accompanying Audio | Audio Length | Audio License| Text source | Dataset License\n", "| ------- | ------ | ---- | ---- | ---- | ---- | ---- | ---- | \n", "| [MusicCaps](https://www.kaggle.com/datasets/googleai/musiccaps) | Captions | 5.5k | ❌
(YT IDs from AudioSet)| 10s | - | Human-written (by musicians) | CC BY-SA 4.0 |\n", @@ -22,7 +24,8 @@ "|[MUCaps](https://huggingface.co/datasets/M2UGen/MUCaps) | Captions | 22k | ❌ (YT IDs from AudioSet) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n", "|[MuEdit](https://huggingface.co/datasets/M2UGen/MUEdit) | Music editing instructions | 11k | ❌
(MusicCaps) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n", "|[FUTGA](https://huggingface.co/datasets/JoshuaW1997/FUTGA) | Captions (fine-grained) | 51.8k | ❌
(MusicCaps, Song Describer Dataset) | 2-5min | - | Synthetic (generated from audio via FUTGA) | Apache-2.0 |\n", - "|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | - | - | Human-written (Amazon customers) | MIT |" + "|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | - | - | Human-written (Amazon customers) | MIT |\n", + "```" ] }, { diff --git a/_sources/description/evaluation.md b/_sources/description/evaluation.md index 569bf8e..219caa7 100644 --- a/_sources/description/evaluation.md +++ b/_sources/description/evaluation.md @@ -1,6 +1,5 @@ (description_evaluation)= # Evaluation - Reliably evaluating music description systems is a challenging endeavour. Even when we have "grounth-truth" captions, it is not always clear how to score generated text, as music description is open-ended, and at least partially subjective. The quality of a description is also strongly dependent on the context in which it is used. This issue gets even more pronounced with more dialogue-based tasks like MQA or other forms of instruction-based description. Comparing outputs to gold standard from static datasets can help, but it's only the first step. @@ -25,11 +24,20 @@ We briefly review each of these metrics below: * **BERT-Score** also computes the similarity between tokens in a generated sentence and tokens in the ground-truth text, but does so using contextual embeddings obtained from a pre-trained BERT model instead of exact matches, resulting in a higher correlation with human judgements. -## Other types of automatic evaluation -* Multiple-choice question answering: MuChoMusic {cite}`weck_muchomusic_2024` -* Other benchmarks: OpenMU {cite}`zhao_openmu_2024` -* LLM-as-a-judge -* Non audio: {cite}`li_music_2024` +### Limitations +While a useful starting point in evaluating model outputs on more closed-ended tasks, these metrics are unable to capture all admissable variations in music description. For example, given a music track, there may be several possible captions that are equally valid but share very little in terms of syntactic or semantic similarity. Both in the music domain and in others such as general audio description, many studies have highlighted important limitations of these metrics, for example showing they fail to account for valid variations in captions and to align with human judgement {cite}`lee2024captioningmetricsreflectmusic`. For this reason, including human evaluation and task-specific benchmarks is necessary for a more well-rounded evaluation. + +## Benchmarks +To overcome some of the shortcomings of match-based metrics, a few benchmarks have recently emerged with the goal of assessing music understanding or description via multiple-choice question-answering. These also better suit the conversational format of more recent music description systems, as they focus on assessing responses to specific user prompts (questions). Some benchmarks of this kind are designed for general audio-language evaluation and include music as part of a wider range of domains. Among these are AudioBench {cite}`wang2024audiobench` and AIR-Bench {cite}`yang-etal-2024-air`. Others, including [MuChoMusic](https://mulab-mir.github.io/muchomusic/) {cite}`weck_muchomusic_2024` and [OpenMU](https://mzhaojp22.github.io/open_music_understanding/) {cite}`zhao_openmu_2024`, directly focus on music: + +```{figure} ./img/muchomusic.png +--- +name: muchomusic +width: 400px +align: center +--- + +``` ## References diff --git a/_sources/description/models.md b/_sources/description/models.md index 5602016..e3ba031 100644 --- a/_sources/description/models.md +++ b/_sources/description/models.md @@ -3,26 +3,30 @@ Deep learning models for music description via natural language typically fit into one of two designs: -- Encoder-decoder -- Multimodal Autoregressive Models +- [Encoder-decoder](encoder_decoder_models) models +- [Multimodal AR](multimodal_ar) models, most often in the form of [adapted LLMs](adapted_llms) -In {numref}`description_models_table` below we give an overview of music description models from 2016 to today. * denotes taks that don't fall under the music description umbrella but are still addressed by the model. +In {numref}`description_models_table` we give an overview of music description models from 2016 to today. * denotes taks that don't fall under the music description umbrella but are still addressed by the model. ```{table} Music description models. :name: description_models_table | Model | Type | Task(s) | Weights | Training dataset | | ------- | ------ | ---- | ---- | ---- | -| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Playlist captioning | ❌ | Private | -| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private | -| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ | | -| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private | -| BLAP {cite}`lanzendorfer_blap_2024` | Encoder-decoder | | | | -| MuLLama {cite}`liu_music_2024` | Adapted LLM | | | | -| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | | | | -| M2UGen{cite}`hussain2023m` | Adapted LLM | | | | -| LLark {cite}`gardner2023llark` | Adapted LLM | | | | +| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Captioning (playlist) | ❌ | Private data | +| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private data | +| PlayNTell {cite}`gabbolini-etal-2022-data` | Encoder-decoder | Captioning (playlist) | ✅ [link]() | PlayNTell | +| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ [link](https://huggingface.co/seungheondoh/lp-music-caps) | LP-MusicCaps | +| ALCAP {cite}`he2023alcap` | Encoder-decoder | Captioning | ❌ | Song Interpretation Dataset, NetEase Cloud Music Review Dataset | +| BLAP {cite}`lanzendorfer_blap_2024` | Adapted LLM | Captioning | ✅ [link](https://huggingface.co/Tino3141/blap/tree/main) | Shutterstock (31k clips) | +| LLark {cite}`gardner2023llark` | Adapted LLM | Captioning, MQA | ❌ | MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune| +| MU-LLaMA {cite}`liu_music_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://huggingface.co/mu-llama/MU-LLaMA/tree/main) | MusicQA | +| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://github.com/zihaod/MusiLingo?tab=readme-ov-file#model-checkpoints) | MusicInstruct | +| M2UGen{cite}`hussain2023m` | Adapted LLM | Captioning, MQA, music generation | ✅ [link](https://huggingface.co/M2UGen) | MUCaps, MUEdit | +| OpenMU {cite}`zhao2024openmu` | Adapted LLM | Captioning, MQA | ✅ [link]() | MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune| +| FUTGA {cite}`wu2024futga` | Adapted LLM | Captioning (fine-grained) | ✅ [link](https://huggingface.co/JoshuaW1997/FUTGA) | FUTGA| ``` +(encoder_decoder_models)= ## Encoder-Decoder Models This is the modelling framework of the earliest DL music captioning models. Encoder-decoder models first emerged in the context of sequence-to-sequence tasks (e.g. machine translation). It is easy to see many tasks can be cast as sequence-to-sequence, so encoder-decoder models found wide use in image captioning first, and audio captioning shortly after, including music. @@ -89,7 +93,7 @@ where $\boldsymbol{w}_{a t t}$ and $\boldsymbol{W}^{a t t}$ are learnable parame Similar types of attention-based fusion can also be used in Transformer-based architectures {cite}`gabbolini-etal-2022-data` {cite}`doh2023lp`. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models: $$ -\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}} +\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}. $$ @@ -103,12 +107,13 @@ align: center In addition to the type of mechanism used, depending on the level at which modalities are combined, it is also common to distinguish between *early* (i.e. at the input level), *intermediate* (at the level of latent representations produced by an intermediate step in the overall processing pipeline) or *late* fusion (i.e. at the output level). We note that the terms *early, intermediate* and *late* fusion do not have an unequivocal definition and are used slightly differently in different works. +(multimodal_ar)= +## Multimodal AR Models +The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. We call these *adapted LLMs*. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description. -## Multimodal Autoregressive Models -The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description. - -Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding tasks by simply allowing users to query via text and obtain information about a given audio input. This is the machanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section. +Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding and description tasks by simply allowing users to query via text and obtain information about a given audio input. This is the mechanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section. +(adapted_llms)= ### Adapted LLMs One modelling paradigm that has become particularly popular in audio description, including music, is that of adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities such as audio. This is achieved via an *adapter* module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings. @@ -121,31 +126,23 @@ align: center --- ``` -🚧 - -Alongside music-specialised multimodal LLMs, a LLM with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count: -* SALMONN {cite}`tang_salmonn_2023` -* Pengi {cite}`deshmukh_pengi_2023` -* Qwen-Audio `chu_qwen-audio_2023` -* LTU -* [Audio-LLM: Activating the Capabilities of Large Language Models to Comprehend Audio Data](https://link.springer.com/chapter/10.1007/978-981-97-4399-5_13) - -We don't discuss these in detail, but their high-level design is similar to the music-specialised models we've seen in this section. - -#### Adapter Modules +The architecture of the adapter modules employed in adapted LLMs for music typically consists of lightweight MLPs (between 2 and 3 hidden layers) or Q-Formers. Other architectures utilised in general audio adapted LLMs (or similar models in the visual domain) also include more complex designs such as Gated XATTN dense layers. [This blog post](https://lilianweng.github.io/posts/2022-06-09-vlm/) about Visual Language Models reviews these in more detail. -#### Training From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data. -##### Instruction Tuning +Alongside music-specialised multimodal LLMs such as those in {numref}`description_models_table`, LLMs with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count: +* SALMONN {cite}`tang_salmonn_2023` +* Pengi {cite}`deshmukh_pengi_2023` +* Qwen-Audio {cite}`chu_qwen-audio_2023` +* LTU {cite}`gong2023listen` +* Audio Flamingo {cite}`kong2024audio_flamingo` +* Audio-LLM {cite}`zhang2024audio_llm` ### Natively Multimodal AR Models -Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start. +Adapted LLMs allow to transform text-only LLMs into multimodal models relatively efficiently: based on the models discussed in this section, around 20-150k audio-text paired samples are required to perform the adaptation stage of training, while multimodal pre-training would require orders of magnitude more data. However, this also limits their performance and often results in a bias towards the language modality and poor audio and music understanding capabilities {cite}`weck_muchomusic_2024`. An alternative that promises to overcome this limitation is to instead adopt a natively multimodal approach to AR modelling. One key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start. This paradigm is sometimes referred to as mixed-modal early-fusion modelling. -It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have: -* AnyGPT {cite}`doh2023lp` -* +It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models, such as AnyGPT {cite}`zhan-etal-2024-anygpt`, include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. ## References diff --git a/_sources/description/tasks.md b/_sources/description/tasks.md index 1d5bd03..d06c7a5 100644 --- a/_sources/description/tasks.md +++ b/_sources/description/tasks.md @@ -87,7 +87,7 @@ align: center A key difference between dialogue-based description and one-off captioning is that, instead of an `audio --> text` mapping, we are now dealing with an `(audio, text) --> text` mapping. This is reflected in the different model designs typically considered for these tasks (see [Models](description_models)). Differently from simple MQA, in music dialogue generation, responses are expected to be based on the entire dialogue history instead of only considering the current input. -In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description-evaluation))! +In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description_evaluation))! ## References diff --git a/bibliography.html b/bibliography.html index 5a68f36..b8cf727 100644 --- a/bibliography.html +++ b/bibliography.html @@ -487,6 +487,10 @@

Bibliography[CFSC17]

Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 2392–2396. 2017. doi:10.1109/ICASSP.2017.7952585.

+
+[CXZ+23] +

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. December 2023. arXiv:2311.07919 [cs, eess]. URL: http://arxiv.org/abs/2311.07919 (visited on 2024-02-26), doi:10.48550/arXiv.2311.07919.

+
[CHL+24]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and others. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.

@@ -611,6 +615,10 @@

Bibliography[GGDRe22]

Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It's raw! audio generation with state-space models. In International Conference on Machine Learning, ICML, volume 162 of Proceedings of Machine Learning Research, 7616–7633. PMLR, 2022.

+
+[GLL+23] +

Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.

+
[GL83]

Daniel W. Griffin and Jae S. Lim. Signal estimation from modified short-time fourier transform. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 804–807. IEEE, 1983.

@@ -667,6 +675,10 @@

Bibliography[KCI+20]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28:2880–2894, 2020.

+
+[KGB+24] +

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. In ICML. 2024. URL: https://openreview.net/forum?id=WYi3WKZjYe.

+
[KWG+24]

Junghyun Koo, Gordon Wichern, Francois G Germain, Sameer Khurana, and Jonathan Le Roux. Smitin: self-monitored inference-time intervention for generative music transformers. arXiv preprint arXiv:2404.02252, 2024.

@@ -691,6 +703,10 @@

Bibliography[Lee10]

Jin Ha Lee. Analysis of user needs and information features in natural language queries seeking music information. Journal of the American Society for Information Science and Technology, 61(5):1025–1045, 2010.

+
+[LL24] +

Jinwoo Lee and Kyogu Lee. Do captioning metrics reflect music semantic alignment? In International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD). 2024. URL: https://arxiv.org/abs/2411.11692.

+
[LN17]

Jongpil Lee and Juhan Nam. Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Processing Letters, 24(8):1208–1212, 2017. doi:10.1109/LSP.2017.2713830.

@@ -703,9 +719,9 @@

Bibliography[LGW+23]

Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, and Tom Nickson. Controllable music production with diffusion models and guidance gradients. In Diffusion Models Workshop at NeurIPS. 2023.

-
-[LYT+24] -

Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models. arXiv preprint arXiv:2406.15885, 2024.

+
+[LTL24] +

Dongting Li, Chenchong Tang, and Han Liu. Audio-llm: activating theâ capabilities ofâ large language models toâ comprehend audio data. In Xinyi Le and Zhijun Zhang, editors, Advances in Neural Networks – ISNN 2024, 133–142. Singapore, 2024. Springer Nature Singapore.

+
+[WZL+24] +

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: a universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020, 2024.

+
+
[YWV+22]

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.

+
+[ZDY+24] +

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. AnyGPT: unified multimodal LLM with discrete sequence modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9637–9662. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.521, doi:10.18653/v1/2024.acl-long.521.

+
[ZIX+24]

Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, and Simon Dixon. Musicmagus: zero-shot text-to-music editing via diffusion models. arXiv preprint arXiv:2402.06178, 2024.

+
+[ZZM+24a] +

Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. Openmu: your swiss army knife for music understanding. arXiv preprint arXiv:2410.15573, 2024.

+
-[ZZM+24] +[ZZM+24b]

Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: http://arxiv.org/abs/2410.15573 (visited on 2024-11-09).

diff --git a/description/datasets.html b/description/datasets.html index 9139ac8..b830ca3 100644 --- a/description/datasets.html +++ b/description/datasets.html @@ -439,7 +439,8 @@

Contents

Datasets#

Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn’t directly distributed with the dataset and may be subject to copyright.

-
+
+ @@ -565,7 +566,7 @@

Contents

Human-written text#

-

Among the datasets containing music captions, only three feature fully human-written descriptions: MusicCaps [ADB+23], the Song Describer Dataset [MWD+23] and YouTube8M-MusicTextClips [MSSR23].

+

Among the datasets containing music captions, only three feature fully human-written descriptions: MusicCaps [ADB+23], the Song Describer Dataset [MWD+23] and YouTube8M-MusicTextClips [MSSR23].

Some example of music captions from the SDD are shown below:

@@ -667,7 +668,7 @@

Human-written text

Synthetic Text#

-

Datasets with human-provided captions, particularly MusicCaps and SDD, or tags often form the basis of other derived audio-text music datasets. Among these, some transform existing annotations by use of text templates (e.g. MusicBench [MGG+23]), or LLM-enabled augmentation (e.g. MusicQA [LHSS24], MusicInstruct [DML+24]) to obtain different kinds of text annotation such as more captions or question-answer pairs. In other cases, like in the MUCaps [HLSS23] and FUTGA [WNN+24] datasets, synthetic text annotations are instead produced based on the audio itself, by means of auxiliary audio captioning models.

+

Datasets with human-provided captions, particularly MusicCaps and SDD, or tags often form the basis of other derived audio-text music datasets. Among these, some transform existing annotations by use of text templates (e.g. MusicBench [MGG+23]), or LLM-enabled augmentation (e.g. MusicQA [LHSS24], MusicInstruct [DML+24]) to obtain different kinds of text annotation such as more captions or question-answer pairs. In other cases, like in the MUCaps [HLSS23] and FUTGA [WNN+24] datasets, synthetic text annotations are instead produced based on the audio itself, by means of auxiliary audio captioning models.

References#

diff --git a/description/evaluation.html b/description/evaluation.html index b56501d..b1b2dc9 100644 --- a/description/evaluation.html +++ b/description/evaluation.html @@ -424,8 +424,11 @@

Contents

@@ -456,30 +459,40 @@

Match-based metrics +

Limitations#

+

While a useful starting point in evaluating model outputs on more closed-ended tasks, these metrics are unable to capture all admissable variations in music description. For example, given a music track, there may be several possible captions that are equally valid but share very little in terms of syntactic or semantic similarity. Both in the music domain and in others such as general audio description, many studies have highlighted important limitations of these metrics, for example showing they fail to account for valid variations in captions and to align with human judgement [LL24]. For this reason, including human evaluation and task-specific benchmarks is necessary for a more well-rounded evaluation.

-
-

Other types of automatic evaluation#

-
    -
  • Multiple-choice question answering: MuChoMusic [WMB+24]

  • -
  • Other benchmarks: OpenMU [ZZM+24]

  • -
  • LLM-as-a-judge

  • -
  • Non audio: [LYT+24]

  • -
+
+
+

Benchmarks#

+

To overcome some of the shortcomings of match-based metrics, a few benchmarks have recently emerged with the goal of assessing music understanding or description via multiple-choice question-answering. These also better suit the conversational format of more recent music description systems, as they focus on assessing responses to specific user prompts (questions). Some benchmarks of this kind are designed for general audio-language evaluation and include music as part of a wider range of domains. Among these are AudioBench [WZL+24] and AIR-Bench [YXL+24]. Others, including MuChoMusic [WMB+24] and OpenMU [ZZM+24], directly focus on music:

+
+../_images/muchomusic.png +

References#

-
+
-
-[LYT+24] -

Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models. arXiv preprint arXiv:2406.15885, 2024.

+
+[LL24] +

Jinwoo Lee and Kyogu Lee. Do captioning metrics reflect music semantic alignment? In International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD). 2024. URL: https://arxiv.org/abs/2411.11692.

-
-[WMB+24] +
+[WZL+24] +

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: a universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020, 2024.

+
+
+[WMB+24]

Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In 25th International Society for Music Information Retrieval Conference. August 2024. arXiv:2408.01337 [cs, eess]. URL: http://arxiv.org/abs/2408.01337 (visited on 2024-08-21), doi:10.48550/arXiv.2408.01337.

-
-[ZZM+24] +
+[YXL+24] +

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: benchmarking large audio-language models via generative comprehension. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1979–1998. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.109, doi:10.18653/v1/2024.acl-long.109.

+
+
+[ZZM+24]

Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: http://arxiv.org/abs/2410.15573 (visited on 2024-11-09).

@@ -551,8 +564,11 @@

References

diff --git a/description/models.html b/description/models.html index 7d5c754..9ad2b7c 100644 --- a/description/models.html +++ b/description/models.html @@ -429,15 +429,8 @@

Contents

  • Conditioning and Fusion
  • -
  • Multimodal Autoregressive Models
  • Table 2 Music description datasets.#

    Dataset

    Content

    @@ -472,65 +465,83 @@

    Contents

    - + - + - + - + - + - + + + + + + + - - + + - + - + - + - - - - - + + + + + + + + + + + - + - - - + + + - + - - - + + + - + - - - + + + - + - - - + + + + + + + + +
    Table 1 Music description models.#

    Choi et al. [MBQF21]

    Choi et al. [MBQF21]

    Encoder-decoder

    Playlist captioning

    Captioning (playlist)

    Private

    Private data

    MusCaps [MBQF21]

    MusCaps [MBQF21]

    Encoder-decoder

    Captioning, retrieval*

    Private

    Private data

    LP-MusicCaps [MBQF21]

    PlayNTell [GHE22]

    Encoder-decoder

    Captioning (playlist)

    link

    PlayNTell

    LP-MusicCaps [MBQF21]

    Encoder-decoder

    Captioning

    link

    LP-MusicCaps

    MusCaps [MBQF21]

    ALCAP [HHL+23]

    Encoder-decoder

    Captioning, retrieval*

    Captioning

    Private

    Song Interpretation Dataset, NetEase Cloud Music Review Dataset

    BLAP [LPPW24]

    Encoder-decoder

    BLAP [LPPW24]

    Adapted LLM

    Captioning

    link

    Shutterstock (31k clips)

    LLark [GDSB23]

    Adapted LLM

    Captioning, MQA

    MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune

    MuLLama [LHSS24]

    MU-LLaMA [LHSS24]

    Adapted LLM

    Captioning, MQA

    link

    MusicQA

    MusiLingo [DML+24]

    MusiLingo [DML+24]

    Adapted LLM

    Captioning, MQA

    link

    MusicInstruct

    M2UGen[HLSS23]

    M2UGen[HLSS23]

    Adapted LLM

    Captioning, MQA, music generation

    link

    MUCaps, MUEdit

    LLark [GDSB23]

    OpenMU [ZZM+24]

    Adapted LLM

    Captioning, MQA

    link

    MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune

    FUTGA [WNN+24]

    Adapted LLM

    Captioning (fine-grained)

    link

    FUTGA

    -

    Encoder-Decoder Models#

    +

    Encoder-Decoder Models#

    This is the modelling framework of the earliest DL music captioning models. Encoder-decoder models first emerged in the context of sequence-to-sequence tasks (e.g. machine translation). It is easy to see many tasks can be cast as sequence-to-sequence, so encoder-decoder models found wide use in image captioning first, and audio captioning shortly after, including music.

    As the name suggests, models of this type are composed of two main modules: an encoder and a decoder. Although there are several variations, in the simplest design of these models, the encoder is resposible for processing the @@ -548,10 +559,10 @@

    Encoder-Decoder Models

    Architectures#

    When it comes to the design of the encoder and decoder components, the general philosophy is to adopt state-of-the-art architectures for the respective modalities, balancing our requirements around possible domain-specific restrictions (e.g. the need to capture features at different timescales in music signals), with the computational and data budget we have at our disposal. This is to say that there are many possible designs for encoder-decoder music captioners in theory, but most follow standard choices. Let’s review some below.

    -

    The first example of encoder-decoder model for music description appeared in work by Choi et al. [CFS16]. While this did not yet produce well-formed sentences, a later model by Manco et al., MusCaps [MBQF21], consolidated the use of a similar architecture for track-level music captioning. These early iterations of encoder-decoder music captioners employed CNN-based audio encoders alongside -RNN-based language decoders. More recent iterations of this framework typically make use of a Transformer-based language decoder (e.g. based on Transformer decoders such as GPT-2 [GHE22] or BART [DCLN23]), alongside CNNs [GHE22] or Transformer audio encoders [SCDBK24], and sometimes a hybrid of both [DCLN23].

    +

    The first example of encoder-decoder model for music description appeared in work by Choi et al. [CFS16]. While this did not yet produce well-formed sentences, a later model by Manco et al., MusCaps [MBQF21], consolidated the use of a similar architecture for track-level music captioning. These early iterations of encoder-decoder music captioners employed CNN-based audio encoders alongside +RNN-based language decoders. More recent iterations of this framework typically make use of a Transformer-based language decoder (e.g. based on Transformer decoders such as GPT-2 [GHE22] or BART [DCLN23]), alongside CNNs [GHE22] or Transformer audio encoders [SCDBK24], and sometimes a hybrid of both [DCLN23].

    -description/img/encoder_decoder.png +../_images/encoder_decoder.png

    @@ -562,7 +573,7 @@

    Conditioning and Fusion

    In most cases, however, we deal with more sophisticated architectures, and conditioning is realised through fusion of audio and text representations. -Earlier models with RNN-based text decoders employ a range of fusion mechanisms, such as feature concatenation or cross-modal attention [MBQF21]. Concatenation as a modality fusion mechanism in RNNs typically consists of concatenating an audio embedding (e.g. the output of the encoder module \(\boldsymbol{a}\)) to the input \(\boldsymbol{x}\), so that an RNN state \(\boldsymbol{h}\) depends on \([\boldsymbol{a}; \boldsymbol{x}]\), or to the previous state vector \([\boldsymbol{a}; \boldsymbol{h}_{t-1}]\), and sometimes to both. In this case, we assume that the encoder produces a single audio embedding.

    +Earlier models with RNN-based text decoders employ a range of fusion mechanisms, such as feature concatenation or cross-modal attention [MBQF21]. Concatenation as a modality fusion mechanism in RNNs typically consists of concatenating an audio embedding (e.g. the output of the encoder module \(\boldsymbol{a}\)) to the input \(\boldsymbol{x}\), so that an RNN state \(\boldsymbol{h}\) depends on \([\boldsymbol{a}; \boldsymbol{x}]\), or to the previous state vector \([\boldsymbol{a}; \boldsymbol{h}_{t-1}]\), and sometimes to both. In this case, we assume that the encoder produces a single audio embedding.

    If our encoder produces instead a sequence of audio embeddings, and we wish to retain the sequential nature of the conditioning signal, an alternative way to achieve fusion is through cross-attention. In this case, instead of concatenating the same audio embedding at every time step \(t\), we can compute attention scores \(\beta_{t i}\) to suitably weigh each item in the audio sequence \(\boldsymbol{a}_i\) differently at each time step \(t\):

    \[ @@ -579,10 +590,10 @@

    Conditioning and Fusion

    where \(\boldsymbol{w}_{a t t}\) and \(\boldsymbol{W}^{a t t}\) are learnable parameters.

    -

    Similar types of attention-based fusion can also be used in Transformer-based architectures [GHE22] [DCLN23]. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:

    +

    Similar types of attention-based fusion can also be used in Transformer-based architectures [GHE22] [DCLN23]. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:

    \[ -\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}} +\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}. \]
    ../_images/lp_musiccaps.png @@ -590,104 +601,127 @@

    Conditioning and FusionIn addition to the type of mechanism used, depending on the level at which modalities are combined, it is also common to distinguish between early (i.e. at the input level), intermediate (at the level of latent representations produced by an intermediate step in the overall processing pipeline) or late fusion (i.e. at the output level). We note that the terms early, intermediate and late fusion do not have an unequivocal definition and are used slightly differently in different works.

    -
    -

    Multimodal Autoregressive Models#

    -

    The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today’s state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.

    -

    Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding tasks by simply allowing users to query via text and obtain information about a given audio input. This is the machanism that enables the conversation-based music description tasks we have seen in the Tasks section.

    +
    +

    Multimodal AR Models#

    +

    The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today’s state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. We call these adapted LLMs. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.

    +

    Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding and description tasks by simply allowing users to query via text and obtain information about a given audio input. This is the mechanism that enables the conversation-based music description tasks we have seen in the Tasks section.

    -

    Adapted LLMs#

    +

    Adapted LLMs#

    One modelling paradigm that has become particularly popular in audio description, including music, is that of adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities such as audio. This is achieved via an adapter module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings.

    ../_images/adapted.png
    -

    🚧

    -

    Alongside music-specialised multimodal LLMs, a LLM with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:

    +

    The architecture of the adapter modules employed in adapted LLMs for music typically consists of lightweight MLPs (between 2 and 3 hidden layers) or Q-Formers. Other architectures utilised in general audio adapted LLMs (or similar models in the visual domain) also include more complex designs such as Gated XATTN dense layers. This blog post about Visual Language Models reviews these in more detail.

    +

    From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.

    +

    Alongside music-specialised multimodal LLMs such as those in Table 1, LLMs with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:

    -

    We don’t discuss these in detail, but their high-level design is similar to the music-specialised models we’ve seen in this section.

    -
    -

    Adapter Modules#

    -
    -
    -

    Training#

    -

    From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.

    -
    -
    Instruction Tuning#
    -
    -

    Natively Multimodal AR Models#

    -

    Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start. +

    Adapted LLMs allow to transform text-only LLMs into multimodal models relatively efficiently: based on the models discussed in this section, around 20-150k audio-text paired samples are required to perform the adaptation stage of training, while multimodal pre-training would require orders of magnitude more data. However, this also limits their performance and often results in a bias towards the language modality and poor audio and music understanding capabilities [WMB+24]. An alternative that promises to overcome this limitation is to instead adopt a natively multimodal approach to AR modelling. One key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start. This paradigm is sometimes referred to as mixed-modal early-fusion modelling.

    -

    It’s worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have:

    - +

    It’s worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models, such as AnyGPT [ZDY+24], include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come.

    References#

    -
    +
    -
    -[CFS16] +
    +[CFS16]

    Keunwoo Choi, George Fazekas, and Mark Sandler. Towards music captioning: generating music playlist descriptions. In Extended abstracts for the Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference. 08 2016. doi:10.48550/arXiv.1608.04868.

    -
    -[DML+24] +
    +[CXZ+23] +

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. December 2023. arXiv:2311.07919 [cs, eess]. URL: http://arxiv.org/abs/2311.07919 (visited on 2024-02-26), doi:10.48550/arXiv.2311.07919.

    +
    +
    +[DML+24]

    Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, and Emmanouil Benetos. MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, 3643–3655. Mexico City, Mexico, June 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.findings-naacl.231 (visited on 2024-07-04).

    -
    -[DESW23] +
    +[DESW23]

    Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An Audio Language Model for Audio Tasks. In Thirty-seventh Conference on Neural Information Processing Systems. 2023. arXiv:2305.11834 [cs, eess]. URL: http://arxiv.org/abs/2305.11834 (visited on 2024-02-16), doi:10.48550/arXiv.2305.11834.

    -
    +
    [DCLN23] -(1,2,3,4) +(1,2,3)

    SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: llm-based pseudo music captioning. In International Society for Music Information Retrieval (ISMIR). 2023.

    -
    +
    [GHE22] -(1,2,3) +(1,2,3,4)

    Giovanni Gabbolini, Romain Hennequin, and Elena Epure. Data-efficient playlist captioning with musical and linguistic knowledge. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 11401–11415. Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL: https://aclanthology.org/2022.emnlp-main.784, doi:10.18653/v1/2022.emnlp-main.784.

    -
    -[GDSB23] +
    +[GDSB23]

    Josh Gardner, Simon Durand, Daniel Stoller, and Rachel M Bittner. Llark: a multimodal foundation model for music. arXiv preprint arXiv:2310.07160, 2023.

    -
    -[HLSS23] +
    +[GLL+23] +

    Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.

    +
    +
    +[HHL+23] +

    Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman, and Xuchen Song. Alcap: alignment-augmented music captioner. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16501–16512. 2023.

    +
    +
    +[HLSS23]

    Atin Sakkeer Hussain, Shansong Liu, Chenshuo Sun, and Ying Shan. M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255, 2023.

    -
    -[LPPW24] +
    +[KGB+24] +

    Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. In ICML. 2024. URL: https://openreview.net/forum?id=WYi3WKZjYe.

    +
    +
    +[LPPW24]

    Luca A Lanzendorfer, Nathanal Perraudin, Constantin Pinkl, and Roger Wattenhofer. BLAP: Bootstrapping Language-Audio Pre-training for Music Captioning. In Workshop on AI-Driven Speech, Music, and Sound Generation. 2024.

    -
    -[LHSS24] +
    +[LTL24] +

    Dongting Li, Chenchong Tang, and Han Liu. Audio-llm: activating theâ capabilities ofâ large language models toâ comprehend audio data. In Xinyi Le and Zhijun Zhang, editors, Advances in Neural Networks – ISNN 2024, 133–142. Singapore, 2024. Springer Nature Singapore.

    +
    +
    +[LHSS24]

    Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: advancing text-to-music generation with question answering and captioning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 286–290. 2024. doi:10.1109/ICASSP48485.2024.10447027.

    -
    +
    [MBQF21] -(1,2,3,4,5,6) +(1,2,3,4,5)

    Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas. Muscaps: generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2021.

    -
    -[SCDBK24] +
    +[SCDBK24]

    Nikita Srivatsan, Ke Chen, Shlomo Dubnov, and Taylor Berg-Kirkpatrick. Retrieval guided music captioning via multimodal prefixes. In Kate Larson, editor, Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, 7762–7770. International Joint Conferences on Artificial Intelligence Organization, 8 2024. AI, Arts & Creativity. URL: https://doi.org/10.24963/ijcai.2024/859, doi:10.24963/ijcai.2024/859.

    -
    -[TYS+23] +
    +[TYS+23]

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In The Twelfth International Conference on Learning Representations. October 2023. URL: https://openreview.net/forum?id=14rn7HpKVk (visited on 2024-02-22).

    +
    +[WMB+24] +

    Benno Weck, Ilaria Manco, Emmanouil Benetos, Elio Quinton, George Fazekas, and Dmitry Bogdanov. MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models. In 25th International Society for Music Information Retrieval Conference. August 2024. arXiv:2408.01337 [cs, eess]. URL: http://arxiv.org/abs/2408.01337 (visited on 2024-08-21), doi:10.48550/arXiv.2408.01337.

    +
    +
    +[WNN+24] +

    Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong, Zhouhang Xie, Carol Chen, and Julian McAuley. Futga: towards fine-grained music understanding through temporally-enhanced generative augmentation. arXiv preprint arXiv:2407.20445, 2024.

    +
    +
    +[ZDY+24] +

    Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. AnyGPT: unified multimodal LLM with discrete sequence modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9637–9662. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.521, doi:10.18653/v1/2024.acl-long.521.

    +
    +
    +[ZZM+24] +

    Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. Openmu: your swiss army knife for music understanding. arXiv preprint arXiv:2410.15573, 2024.

    +
    @@ -762,15 +796,8 @@

    ReferencesConditioning and Fusion -
  • Multimodal Autoregressive Models
  • Conversational Music Description#

    @@ -500,7 +500,7 @@

    Conversational Music Description../_images/dialogue.png

    A key difference between dialogue-based description and one-off captioning is that, instead of an audio --> text mapping, we are now dealing with an (audio, text) --> text mapping. This is reflected in the different model designs typically considered for these tasks (see Models). Differently from simple MQA, in music dialogue generation, responses are expected to be based on the entire dialogue history instead of only considering the current input.

    -

    In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see Evaluation)!

    +

    In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see Evaluation)!

    References#

    diff --git a/objects.inv b/objects.inv index 2773b8f..c8a4e60 100644 Binary files a/objects.inv and b/objects.inv differ diff --git a/searchindex.js b/searchindex.js index 7db8f89..b044fb4 100644 --- a/searchindex.js +++ b/searchindex.js @@ -1 +1 @@ -Search.setIndex({"alltitles": {"1. Natural Langauge is (almost) universal label (y), task (z) encoder.": [[16, "natural-langauge-is-almost-universal-label-y-task-z-encoder"]], "2. Natural Langauge is (weak but scalable) supervision for representation learning": [[16, "natural-langauge-is-weak-but-scalable-supervision-for-representation-learning"]], "3. Natural Langauge is Human Friendly interface.": [[16, "natural-langauge-is-human-friendly-interface"]], "About the Authors": [[15, "about-the-authors"]], "Abstract Musical Controls": [[2, "abstract-musical-controls"]], "Adapted LLMs": [[8, "adapted-llms"]], "Adapter Modules": [[8, "adapter-modules"]], "Adaptive Modulation/Normalization": [[21, "adaptive-modulation-normalization"]], "Advances": [[19, null]], "Aligning Language Models with Human Feedback": [[19, "aligning-language-models-with-human-feedback"]], "Apply Text Augmentation Techniques": [[28, "apply-text-augmentation-techniques"]], "Architecture": [[11, "architecture"]], "Architectures": [[8, "architectures"]], "Audio Diversity and Quality": [[12, "audio-diversity-and-quality"]], "Audio-Sentence Joint Embedding": [[28, "audio-sentence-joint-embedding"]], "Audio-Tag Joint Embedding": [[28, "audio-tag-joint-embedding"]], "Audio2Audio Controls": [[2, "audio2audio-controls"]], "Autoregressive Language Models": [[21, "autoregressive-language-models"]], "Background": [[17, null]], "Beyond Audio Modality": [[1, null]], "Beyond Text-Based Interactions": [[2, null]], "Beyond semntica attributes, toward handle similarity queries": [[28, "beyond-semntica-attributes-toward-handle-similarity-queries"]], "Bibliography": [[0, null]], "Chain-of-Thought Reasoning of Language Models": [[19, "chain-of-thought-reasoning-of-language-models"]], "Challenges": [[20, null], [23, null]], "Channel Concatenation": [[21, "channel-concatenation"]], "Code Practice": [[4, null], [24, null]], "Code Tutorial": [[10, null]], "Conclusion": [[3, null]], "Conclusion \ud83c\udf89": [[4, "conclusion"], [24, "conclusion"]], "Conditioning": [[11, "conditioning"], [21, "conditioning"]], "Conditioning and Fusion": [[8, "conditioning-and-fusion"]], "Connecting Music Audio and Natural Language": [[15, null]], "Conversational Music Description": [[9, "conversational-music-description"]], "Conversational Retrieval": [[25, null]], "Datasets": [[5, null]], "Diffusion Model-based Text-to-Music Generation": [[11, null]], "Diffusion: Continuous Generation through Iterative Refinement": [[11, "diffusion-continuous-generation-through-iterative-refinement"]], "Distillation of Language Models": [[19, "distillation-of-language-models"]], "Early Stage Retrieval Methods": [[27, "early-stage-retrieval-methods"]], "Early Stage of Music Annotation and Retrieval": [[17, "early-stage-of-music-annotation-and-retrieval"]], "Early Stage of Music Generation": [[17, "early-stage-of-music-generation"]], "Employ Strategic Negative Sampling": [[28, "employ-strategic-negative-sampling"]], "Encoder-Decoder Attention (a.k.a. Cross Attention)": [[21, "encoder-decoder-attention-a-k-a-cross-attention"]], "Encoder-Decoder Models": [[8, "encoder-decoder-models"]], "Evaluation": [[6, null], [12, null], [26, null]], "Fr\u00e9chet Inception Distance (FID/FAD)": [[12, "frechet-inception-distance-fid-fad"]], "Future Directions": [[25, "future-directions"]], "Getting Started": [[15, "getting-started"]], "History": [[13, "history"]], "Human-written text": [[5, "human-written-text"]], "Implementing Language Models": [[21, "implementing-language-models"]], "Inception Score": [[12, "inception-score"]], "Inference & Make Retrieval Engine": [[24, "inference-make-retrieval-engine"]], "Initialize with Pre-trained Models": [[28, "initialize-with-pre-trained-models"]], "Instruction Tuning": [[8, "instruction-tuning"]], "Introduction": [[4, "introduction"], [7, null], [13, null], [22, null], [24, "introduction"], [27, null]], "Key Benefits of Conversational Retrieval": [[25, "key-benefits-of-conversational-retrieval"]], "Key Technical Challenges": [[25, "key-technical-challenges"]], "Langauge Models": [[18, "langauge-models"]], "Language Models as a Framework": [[21, "language-models-as-a-framework"]], "Let\u2019s Get Started! \ud83d\ude80": [[4, "let-s-get-started"], [24, "let-s-get-started"]], "Leverage Diverse Training Data Sources": [[28, "leverage-diverse-training-data-sources"]], "Limitation": [[12, "limitation"]], "Limitations": [[19, "limitations"]], "Limitations of Single-Turn Systems": [[25, "limitations-of-single-turn-systems"]], "Listening Test": [[12, "listening-test"]], "MOS Test (Mean Opinion Score)": [[12, "mos-test-mean-opinion-score"]], "MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)": [[12, "mushra-test-multiple-stimuli-with-hidden-reference-and-anchor"]], "Masked Language Models": [[21, "masked-language-models"]], "Match-based metrics": [[6, "match-based-metrics"]], "Metric Learning Loss Functions": [[28, "metric-learning-loss-functions"]], "Models": [[8, null], [28, null], [28, "id3"]], "Motivation & Aims": [[15, "motivation-aims"]], "Multi-modal Joint Embedding Model Architecture": [[28, "multi-modal-joint-embedding-model-architecture"]], "Multimodal Autoregressive Models": [[8, "multimodal-autoregressive-models"]], "Multimodal Decoders for Language Model Outputs": [[19, "multimodal-decoders-for-language-model-outputs"]], "Multimodal Encoders for Language Model Inputs": [[19, "multimodal-encoders-for-language-model-inputs"]], "Music Captioning": [[9, "music-captioning"]], "Music Classification": [[9, "music-classification"]], "Music Description": [[18, "music-description"]], "Music Generation": [[18, "music-generation"]], "Music Question Answering": [[9, "music-question-answering"]], "Music Retrieval": [[18, "music-retrieval"]], "Music description models.": [[8, "description-models-table"]], "MusicGEN": [[14, null], [14, "id4"]], "Natively Multimodal AR Models": [[8, "natively-multimodal-ar-models"]], "Neural Audio Codec": [[14, "neural-audio-codec"]], "Other types of automatic evaluation": [[6, "other-types-of-automatic-evaluation"]], "Overview of Tutorial": [[18, null]], "Overview of this tutorial section": [[7, "overview-of-this-tutorial-section"]], "Performance & Efficiency": [[20, "performance-efficiency"]], "Precision and Recall": [[26, "precision-and-recall"]], "Prefix Conditioning": [[21, "prefix-conditioning"]], "Prerequisites": [[4, "prerequisites"], [24, "prerequisites"]], "Problem Definition": [[13, "problem-definition"]], "Problem: Out of Vocabulary": [[27, "problem-out-of-vocabulary"]], "Query-Caption Distribution Mismatch": [[23, "query-caption-distribution-mismatch"]], "References": [[5, "references"], [6, "references"], [8, "references"], [9, "references"], [17, "references"], [18, "references"], [23, "references"], [25, "references"], [27, "references"], [28, "references"]], "Representation": [[11, "representation"]], "Representation: Text as Sequence of Tokens": [[21, "representation-text-as-sequence-of-tokens"]], "Resources for Further Learning \ud83d\udcda": [[24, "resources-for-further-learning"]], "Resources \ud83d\udcda": [[4, "resources"]], "Results \ud83d\udcc8": [[4, "results"]], "Retrieval-Augmented Generation (RAG)": [[19, "retrieval-augmented-generation-rag"]], "Scaling Laws of Language Models": [[19, "scaling-laws-of-language-models"]], "Single-Turn Retrieval Limitations": [[23, "single-turn-retrieval-limitations"]], "Stable Audio Open Tutorial": [[10, "stable-audio-open-tutorial"]], "Step 1: Setting Up Our Environment": [[24, "step-1-setting-up-our-environment"]], "Step 1: Setting up our environment": [[4, "step-1-setting-up-our-environment"]], "Step 2: Loading the data \ud83d\udcca": [[4, "step-2-loading-the-data"]], "Step 2: Understanding the Data \ud83d\udcca": [[24, "step-2-understanding-the-data"]], "Step 3: Creating Our Dataset Class \ud83c\udfa8": [[24, "step-3-creating-our-dataset-class"]], "Step 3: Creating our dataset class \ud83c\udfa8": [[4, "step-3-creating-our-dataset-class"]], "Step 4: Building & Training Our Model Architecture \ud83c\udfd7\ufe0f": [[24, "step-4-building-training-our-model-architecture"]], "Step 4: Building and training our model \ud83c\udfd7\ufe0f": [[4, "step-4-building-and-training-our-model"]], "Synthetic Text": [[5, "synthetic-text"]], "Tasks": [[9, null]], "Text Relevance": [[12, "text-relevance"]], "The Framework": [[21, null]], "The axes of music description": [[7, "the-axes-of-music-description"]], "Tips for Training Audio-Text Joint Embedding Models": [[28, "tips-for-training-audio-text-joint-embedding-models"]], "Tool Use and Function Calling": [[19, "tool-use-and-function-calling"]], "Training": [[8, "training"]], "Transfer Learning from Language Models": [[19, "transfer-learning-from-language-models"]], "Trust & Safety": [[20, "trust-safety"]], "Types of music captioning": [[9, "types-of-music-captioning"]], "What We\u2019ll Build": [[24, "what-we-ll-build"]], "What are language models?": [[22, "what-are-language-models"]], "What is music description? And why do we need it?": [[7, "what-is-music-description-and-why-do-we-need-it"]], "What is the Benefit of Joint Embedding?": [[28, "what-is-the-benefit-of-joint-embedding"]], "What we will build": [[4, "what-we-will-build"]], "Why Natural Langauge?": [[16, null]], "Zero-shot Task Transfer and In-Context Learning": [[19, "zero-shot-task-transfer-and-in-context-learning"]]}, "docnames": ["bibliography", "conclusion/beyondaudio", "conclusion/beyondtext", "conclusion/intro", "description/code", "description/datasets", "description/evaluation", "description/intro", "description/models", "description/tasks", "generation/code", "generation/diffusionmodel", "generation/evaluation", "generation/intro", "generation/lmmodel", "intro", "introduction/advantange", "introduction/background", "introduction/overview", "lm/advances", "lm/challenges", "lm/framework", "lm/intro", "retrieval/challenge", "retrieval/code", "retrieval/conversational_retrieval", "retrieval/evaluate", "retrieval/intro", "retrieval/models"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinxcontrib.bibtex": 9}, "filenames": ["bibliography.md", "conclusion/beyondaudio.md", "conclusion/beyondtext.md", "conclusion/intro.md", "description/code.ipynb", "description/datasets.ipynb", "description/evaluation.md", "description/intro.md", "description/models.md", "description/tasks.md", "generation/code.ipynb", "generation/diffusionmodel.md", "generation/evaluation.md", "generation/intro.md", "generation/lmmodel.md", "intro.md", "introduction/advantange.ipynb", "introduction/background.md", "introduction/overview.md", "lm/advances.md", "lm/challenges.md", "lm/framework.md", "lm/intro.md", "retrieval/challenge.md", "retrieval/code.ipynb", "retrieval/conversational_retrieval.md", "retrieval/evaluate.md", "retrieval/intro.md", "retrieval/models.md"], "indexentries": {}, "objects": {}, "objnames": {}, "objtypes": {}, "terms": {"": [0, 2, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27, 28], "0": [4, 5, 9, 10, 11, 12, 16, 24, 26, 28], "00": [], "000": 22, "000061": 10, "00006103515625": 10, "00341": [0, 18], "0050b2820a1e709ffa623f9a9e8ae42d0903535f2150613cbfeb7f16932a": [], "00512": [], "0083": 24, "00830": [], "0092": 4, "00927": [], "01": [0, 9], "01095": [], "01103": 0, "01324": [], "01337": [0, 6], "01420": [0, 5], "01546": [], "01618": [], "01626": [], "01652": [0, 18], "01733": [], "01840": [], "019": [], "01917": 0, "02": [0, 4, 8, 24], "021c1d407befb505791764ad2cbd56ceaaa53a746baed01d2e2143f05f18": [], "02252": 0, "02257": 0, "02696": [], "03": [0, 9], "03458": [], "03499": [0, 17], "03739": [], "03748": [0, 28], "03917": [], "04": [0, 5, 8, 9], "04208": [], "04378": [], "04628": [], "04658": [], "04805": [0, 18, 28], "04868": [0, 8], "05": [], "05011": 0, "05224": [], "056d58b606731f94fe395266c604ea9efcecc10e6857ceb9b10e6831d746": [], "0577": 4, "0583": 4, "0586": 4, "0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822": [], "05967": [], "06": [0, 9], "06125": [], "06174": [], "06178": 0, "0686": [], "07": [0, 5, 8, 9, 24], "0702": 4, "07069": [0, 18], "07160": [0, 8, 9, 18], "07439": [0, 23, 25], "07724": [], "07837": [0, 17], "07848": [], "0791": 4, "07919": [], "08": [0, 6, 8], "08070": [], "08384": 0, "08466": [], "08667": 0, "08691": [], "08774": [0, 18], "08803": [], "09": [0, 6], "0933": 4, "09636": [], "0984": 4, "0a": [], "0a0": [], "0a1": [], "0b": [], "0da8e798b168": 4, "0dfc83e0fe455cfe6272b23a65039b4101c63a4e7446801e26178b675fbf": [], "0ea5e3611e0b63766a56f81e7bc5cfa05c52e3a3f0b8d66b25c7262aeda": [], "0m": [], "1": [0, 2, 5, 6, 8, 9, 10, 11, 12, 15, 18, 19, 26, 28], "10": [0, 4, 5, 6, 8, 9, 10, 15, 24, 28], "100": [4, 12, 24], "1000": 10, "10057": [0, 5, 23], "10191775": [0, 9], "1024": [4, 24], "1025": [0, 23], "10301": [], "1032d0dbc2152c45f3d1e582a72e68f41898de9665202392d9400dfa329d": [], "1038": [], "10447027": [0, 5, 8], "1045": [0, 23], "104e9f575c27679ffedf994e53e6ac39067a0e77b2ea0d1567d4738686": [], "106": [], "1068": 0, "1076": [0, 9], "1077": 0, "10789": [], "10828fb40dcf097d1af84c1f2f863bae4046d5949450bf95b3260f767672": [], "10970": 0, "10f97f73544edcdef54409f1d839f6049a0d79df68adbc1ceb24d1aaca42": [], "11": [0, 6], "1109": [0, 5, 8, 9], "1116": 0, "1120": 0, "11255": [0, 5, 8], "11305": 0, "11315": 0, "11325": [0, 5, 18], "113k": 5, "114": [], "11401": [0, 8, 9], "11415": [0, 8, 9], "1141a8232723dcb10a595cc0ce4321dcbbd5215300bf4acfc142343205bf": [], "1146": 24, "11489": [0, 25], "11498": [0, 28], "114m": [], "115": [], "1165": 4, "11692": [0, 28], "11757": [], "1180": 0, "11834": [0, 8], "1186": [], "1188": 0, "11994": [], "11k": 5, "12": [0, 16], "12015": [], "1208": [0, 9], "120bpm": 16, "121": [], "1212": [0, 9], "12179": [], "12207897848a653d03ebbf6775a29d949408ded5f99b2d87198bc5c93508": [], "12208": [0, 18, 28], "12415": [0, 18, 28], "125": 0, "125817600": 4, "12661": [], "12662": 0, "1267": 4, "128": [4, 24], "12839": [], "13": [0, 16], "130bpm": 16, "13218": [], "13301": [], "13438": 0, "13569": [0, 28], "1362": 0, "13686": [], "1371": 0, "13731": [], "14": [4, 15], "140": [0, 10, 18], "1412": [], "14167": [], "1426": 4, "14358": 0, "1446": 4, "14784": [0, 5], "14793": [0, 5], "1481": 4, "14867": [], "149": [], "14rn7hpkvk": [0, 8], "15": 28, "150": 16, "15018": [], "1514580907b0bac0970415e5e24ef96a9c1fa71dcf2aa0139045b58fae9a": [], "1534": 0, "15573": [0, 6], "156": [], "15885": [0, 6], "16": [0, 8, 12, 13, 16, 17, 18, 27], "1601": [4, 24], "1604": [], "1608": [0, 8], "1609": [0, 17], "1612": [0, 17], "162": 0, "163": [], "16322": [], "16372": [], "16501": [0, 9], "16512": [0, 9], "1679": 24, "16798": [0, 9], "17": [0, 12, 13, 14, 17], "17042": [], "17162": [], "173": 0, "179": [], "179dd1bf8fd6bd689f0907f4baed557d2b12d2cf3d7ed1a8ecefe0a63d83": [], "17a": 0, "17b": [0, 13], "17th": [0, 8], "18": [0, 13, 17, 18], "1802": [], "1805": [], "1807": [0, 28], "1810": [0, 18, 28], "1812": [], "18407": [], "18503": [], "18653": [0, 8, 9], "1869": 4, "1874": 4, "18754": 24, "18828": [], "18th": 0, "19": [0, 13, 18], "1907": [0, 28], "19159": [], "1937": [], "194": 0, "1950": [], "19512": [], "1964": [], "1970": 13, "1975": [], "1979": 0, "1982": 0, "1983": 0, "1989": 0, "1990": 13, "1992": [], "19d5ff584cb58f654d22d8d6552d7c2fff7b85e4a9d525357f62a4d1e7e0": [], "1a": [], "1b69b697fe067d51219cfd64d0712bcbbce3b187389cb0793d9844ec14b1": [], "1bdb57a072903b222b1a745aa634cb845ff5f52a88ddd5ed1640ecf30beb": [], "1c": [], "1d": [11, 14], "1e": [4, 24], "1f": [], "1f0a22a6bcdd3fc26c73f63a025d05bd565901b729d56bcb093c722a6c4c": [], "1k": [], "1m": 11, "2": [0, 2, 3, 5, 6, 8, 10, 11, 15, 17, 18, 19, 27], "20": [0, 12, 13, 15, 18], "200": 27, "2000": 17, "2001": 0, "2002": [0, 9], "2003": [0, 9], "2005": [0, 17, 18, 27], "2007": [0, 17], "2008": [0, 17, 18, 27], "2009": [], "2010": [0, 9, 17, 23], "2012": [], "2013": [], "2014": [], "2015": 13, "2016": [0, 8, 17], "2017": [0, 9, 17], "2018": [0, 13, 17, 18, 28], "2019": [0, 17, 18, 28], "202": 0, "2020": [0, 13, 18], "2021": [0, 8, 9, 11, 15, 18, 28], "2022": [0, 8, 9, 18, 28], "2023": [0, 5, 8, 9, 18, 23, 25, 28], "2024": [0, 5, 6, 8, 9, 15, 18, 23, 25, 28], "20445": [0, 5, 9], "207": [], "20a": 0, "20b": [0, 13], "20xx": [], "21": [0, 6, 9, 11, 16, 18, 28], "2104": [], "2109": [0, 18], "2110": [], "2111": 0, "214": [], "21450": 0, "21474": 0, "2161": [0, 9], "21783": [], "21th": 0, "22": [0, 8, 13, 16, 18, 28], "2202": [], "2204": [], "2205": 0, "22050": [4, 24], "2206": [], "2208": [0, 18, 28], "2210": 0, "2211": [], "2226": 0, "2231": 4, "2234": 0, "22a": [], "22b": [], "22k": 5, "23": [0, 2, 5, 8, 9, 11, 12, 13, 18, 23, 25, 28], "2301": [0, 5, 18, 25], "2302": 0, "2303": [0, 18], "2304": [], "2305": [0, 8], "2307": [], "2308": [], "231": [0, 5, 8, 9], "2310": [0, 8, 9, 18], "2311": [0, 5, 8, 18, 23], "2312": [], "2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324": [], "2350": 0, "2354": 0, "2358": 4, "237m": [], "238": 0, "2392": [0, 9], "2396": [0, 9], "23a": [0, 11], "23b": [0, 13], "23ef2fd02913d65d43dc7516fc829af709314a66c6f0bdc2e361fdcecc2d": [], "24": [0, 2, 5, 6, 8, 9, 11, 13, 14, 18, 23, 25], "2401": [], "2402": 0, "2403": [], "2404": [0, 28], "2405": [], "2406": [0, 6], "2407": [0, 5, 9], "2408": [0, 6], "2409": [0, 28], "2410": [0, 6], "2411": [0, 23, 25], "249": [], "24963": [0, 8], "24a": [], "24b": [], "24th": 0, "25": [0, 18], "25bcf75e373412daf1fd88045ab3aa8140a0d804ef0e70712c4f2c5b94d8": [], "25h": [], "25hcollect": [], "25hdownload": [], "25hrequir": [], "25l": [], "25th": [0, 6, 15], "26": [], "26045404a30c8a200e960fb54fbaf4b73d12e58cd28e03b306b084253f4f": [], "262145": 24, "263": [], "264k": 5, "265": 10, "266": [], "27": [], "2713830": [0, 9], "273186269": 0, "2754": [], "2764": [], "2788": 24, "28": 0, "28492": [], "28518": [], "286": [0, 5, 8], "287": [], "2880": 0, "2894": 0, "28k": 5, "28th": 0, "29": 5, "290": [0, 5, 8], "2919": 24, "293": [0, 9], "2971": 24, "2a": [], "2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7": [], "2b": [], "2c": [], "2d": 11, "2d1c0ebfd092e25935b86509a9a817159212d82aa43d7fb07eca4eeff2c2": [], "2d231b35456506b7c98b3ab9bbf07917b205fed8615d2e59e976ab497fff": [], "2d512efdb0de203d1f0312fae53433c3009ba70b0078421d25baaedc960a": [], "2e": [], "2eb3cd785efd67806c46c13a17339708ddc346cbb684eade7a6e6f79536a": [], "2f": [], "2k": 5, "2m": 5, "2min": 5, "2ugen": [], "3": [0, 3, 5, 6, 9, 10, 15, 17, 18, 19], "30": [5, 9, 10, 11, 22], "300": [4, 24], "302": [0, 9], "30aa32745af16af0a9a650115fbe81bde7c610ed5c21b381fca0196f3a7f": [], "31": [], "3122": 4, "313": 0, "3169": [], "317": [], "31884": [4, 24], "319": [], "31m": [], "31m1": [], "31m10": [], "31m108": [], "31m11": [], "31m12": [], "31m122": [], "31m13": [], "31m14": [], "31m141": [], "31m15": [], "31m16": [], "31m17": [], "31m172": [], "31m191": [], "31m2": [], "31m3": [], "31m4": [], "31m470": [], "31m493": [], "31m5": [], "31m6": [], "31m7": [], "31m742": [], "31m768": [], "31m796": [], "31m8": [], "31m834": [], "31m836": [], "31m837": [], "31m839": [], "31m845": [], "31m848": [], "31m849": [], "31m85": [], "31m853": [], "31m855": [], "31m860": [], "31m861": [], "31m868": [], "31m872": [], "31m874": [], "31m878": [], "31m884": [], "31m890": [], "31m897": [], "31m9": [], "31m900": [], "31m904": [], "31m913": [], "31m918": [], "31m920": [], "31m921": [], "31m925": [], "31m937": [], "31m942": [], "31m947": [], "31m949": [], "31m95": [], "31m973": [], "31m978": [], "31m982": [], "31m995": [], "31merror": [], "32": [0, 4, 9, 11], "324": 0, "326": 0, "32767": 10, "32m0": [], "32m1": [], "32m10": [], "32m106": [], "32m11": [], "32m112": [], "32m12": [], "32m121": [], "32m122": [], "32m13": [], "32m14": [], "32m143": [], "32m149": [], "32m15": [], "32m16": [], "32m162": [], "32m163": [], "32m17": [], "32m174": [], "32m179": [], "32m18": [], "32m19": [], "32m2": [], "32m20": [], "32m207": [], "32m21": [], "32m214": [], "32m22": [], "32m23": [], "32m24": [], "32m25": [], "32m26": [], "32m266": [], "32m27": [], "32m28": [], "32m287": [], "32m29": [], "32m3": [], "32m30": [], "32m31": [], "32m317": [], "32m319": [], "32m32": [], "32m33": [], "32m333": [], "32m34": [], "32m35": [], "32m36": [], "32m368": [], "32m37": [], "32m38": [], "32m389": [], "32m39": [], "32m392": [], "32m399": [], "32m4": [], "32m40": [], "32m41": [], "32m42": [], "32m43": [], "32m434": [], "32m44": [], "32m45": [], "32m46": [], "32m47": [], "32m48": [], "32m481": [], "32m49": [], "32m5": [], "32m50": [], "32m51": [], "32m519": [], "32m52": [], "32m53": [], "32m54": [], "32m55": [], "32m56": [], "32m563": [], "32m59": [], "32m6": [], "32m60": [], "32m61": [], "32m614": [], "32m616": [], "32m63": [], "32m64": [], "32m7": [], "32m71": [], "32m727": [], "32m73": [], "32m76": [], "32m77": [], "32m774": [], "32m78": [], "32m8": [], "32m81": [], "32m87": [], "32m890": [], "32m899": [], "32m9": [], "32m90": [], "32m92": [], "32m94": [], "33": [], "331": 0, "333": [], "33437": 24, "33k": 5, "34": 0, "3479": 4, "34th": 0, "35": [], "3523": 24, "3572": 24, "35th": 0, "36": [], "360": [], "3643": [0, 5, 8, 9], "3655": [0, 5, 8, 9], "368": 0, "36m": [], "36m0": [], "37": [], "3727": 24, "375": 0, "38": [], "39": 4, "392": [], "39c7c0d87f8d4e6c020a393182060eaefeeae6c01dab6a84ec346f2567df": [], "3a": [], "3af39d34be01a24a6e65433d19e107099374224905f1e0cc6bbe1fd22a2f": [], "3b": [], "3b00ac340a1aab3389ebcc52c779914a44aadf7b0cb7a3bf053195735607": [], "3c": [], "3d": [], "3e": [], "3f": [], "3k": [4, 24], "3m": 10, "4": [0, 3, 5, 6, 7, 10, 15, 16, 18], "40": [], "41": 0, "42": [0, 16, 28], "43": [], "434": [], "435d5d7ec64d1c8b422ac9ebe42d2f3b2ac0b3f8a56f5c04dd0f3b7ba83c": [], "4361": 0, "4370": 0, "44": 11, "440": [], "4407": [0, 9], "44100": 10, "45": [4, 24], "4524": 4, "454d6e7f0158951d8a78c2e1eb4f69ae81beb8dca5fee9809c6c99e9d0d0": [], "456": 0, "4583": [0, 28], "4587": [0, 28], "46": [], "460": 0, "46649": 24, "467": [0, 17, 18, 27], "46th": [0, 18, 25], "47": [], "476": [0, 17, 18, 27], "48": [], "48072": 24, "48550": [0, 6, 8], "4868": 24, "49": [], "4b": [], "4c": [], "4c4672025c23a305231a81bf492f65aa3ea0965a89f9ca369a9ee7d47fd9": [], "4d": [], "4e": [], "4f": [4, 24], "4f639c1168d7aada749a896afb4892a831e2041bebdcf636aebfe9e86556": [], "4o": 19, "5": [0, 3, 4, 5, 9, 10, 12, 16, 18, 23, 24, 25, 28], "50": [4, 10, 27], "500": 10, "5063": 24, "51": 5, "519": [], "52": [], "521": [], "5244": 24, "525": [], "53": [0, 18], "5302": 24, "531": [0, 17], "534": [0, 17], "53k": 5, "54": [], "540": [], "541": [], "55": [], "5593a40fcd0981bda85274bb3e622ac433a94ae1e11ef8639de362cfa7d": [], "55bpm": 16, "55cdeed5889f2076fdb125bc87bb7ab0f1715c84b0a4619c44833d890f60": [], "56": [0, 28], "564beb0c78bf83018a146dfcdc959c99c10a0d136480b932a350c852adbc": [], "566": [], "57": [], "5730cc60bf438b56438756e45ac469c01bcf9c47d87632c468623167b7f": [], "5781": 4, "58": [], "580600f441f6fc05218bd6c9d5794f4aef072a7d9093b291f1c50a9db8bc": [], "58b70a580de00893223d61de8fea167877a3aed97d4a5e1405c9159ef925": [], "58d71f2041bc89919f56a69f8f2b9535a55d513bb005fbe4f8ee5d367170": [], "59": [], "591": [0, 28], "595": [0, 28], "5a": [], "5a36494314e4780362b15a7e190095eec68366a0d512b5b532607c213a26": [], "5af6804c4cc0fed83f47bff6e413a98a36618e7d40185cd36e69737f3b0": [], "5b": [], "5c": [], "5d": [], "5e": [], "5f30aea01532961bab043775258b06484f2a57530a88940e4cc3aea4f1f1": [], "5k": 5, "5min": 5, "6": [4, 5, 16, 24, 28], "60": [], "607": [], "608": 10, "609961972f694cb9520c4c3d201e377a26583e1eb83bc5a334c893729214": [], "60cd92bd3ec00948800984410f4cf5ded5bd8e9b715729f3642efe0edb3d": [], "61": [0, 23], "616": [], "61b627404c2d6f31dcbc491ff83da1f4336c7ae7893cfdc6c52db490ec59": [], "621": [], "6262": 4, "63": [], "6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8": [], "64": 11, "6402242dde160d9ef9903487b4277443dc3da04615f6c4d3b48564a8ab57": [], "65": [], "66": [], "661": [], "6626": 0, "6637": 0, "67": [0, 18], "671c0e1f2572ba625cbcc1faeba9435e00330c3d6962858711445cf1e817": [], "6724805521ab4e723a12182f92374031032aff28a8a89dc8505c52b79032": [], "6742ef9206409d5ce1fdf44d5ca1687cdc3847ba0485424e2c731e6bcf67": [], "67ebd9d6ce9e65747e720c4c5614cd3a137e61340aec274657fcd9cc5162": [], "68": [], "6809": 24, "681": [], "684": [], "69": [], "693": [], "6980": [], "6a": [], "6b": [], "6d": [], "6e": [], "6e30b6b0cc0c18f8eb566e4f440e8127d9dad32bcaa70d38c8c44a21e62d": [], "6e9f9b41c48750a45ad07cc6d43a2979bfc09e6989656aece97cc59cbef1": [], "6f": [], "7": [4, 5, 10, 16, 24, 25], "70": [0, 18], "7047": 24, "71": [], "72": [], "72a58cb3b241d869811be4f9328a37f1563dc9c48af8c0467cb681f9ed46": [], "73": [], "74": [], "75718504a1bf0562e7e02def34cfc9bb274b6f284773cbeeeba0767a31b": [], "75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40": [], "76": [], "7616": 0, "7633": 0, "768": 24, "77": 0, "774": [], "7762": [0, 8], "7770": [0, 8], "77cc11c7a9ea9fd05503def69e3d18605852cd0d4b0d3b8f15bbeb3ef1d1": [], "77edf4c29c8d6728b49d3f0abb22159bb9c0c4ddebd721c09486b34985c8": [], "78": [], "784": [0, 8, 9], "78bd0e95dd2444b6caacbca2b730671d4295ccb628ef58b81bee903629df": [], "7907": 24, "7925": 24, "7952585": [0, 9], "7b": 5, "7b5a1a5419e400f715387a48f65225ec7a3f2104465f346fc75e8793407b": [], "7c": [], "7dcce24e978bc14a18e2a3f7e2d6f4d2001533dc0cffab143bb3f8ec13d6": [], "7e": [], "7f": [], "8": [0, 4, 5, 8, 9, 16, 18, 24, 28], "80": 0, "800560": [0, 9], "80370da514096c6190f8913668198380ea09c2d252cfa4e85a9c096d3b40": [], "804": 0, "807": 0, "80cc3315dd1ca706643b78f894901d4d888ffe376a5e401f73d9db61071": [], "81": [], "8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c": [], "81d47999aebc1b155f81eca4477a616a70f238a2549848c38983f3c22a82": [], "828": [], "83": [], "83871f3c50fc983b88547c196d11cf8c3340e37c32d2e9d6152abe2c61f7": [], "84": 0, "8462": 24, "85": [], "85249acbac630f34cd113dca4b1a72f55d3ad4c26bc9305a27aef6049756": [], "859": [0, 8], "86": [0, 9], "8630": 4, "8653ae6d18e20183fc6051fd2e10cd0c46e16a6b71eb34edef8d465dc969": [], "86bb218c7926e1da7a52e0696cab120a17c995933f08d8228d9aa83b44c5": [], "87": [], "8748": [0, 28], "8763": [0, 28], "88": [], "8821": [], "8831": [], "88k": 5, "89": [], "890a583cd3f2be27ecf32b479d5d615710bb926d92da03e3f7838ff3e58b": [], "899": [], "8a": [], "8b5d82fe2d9c7f260fb73121418f5e07d4e38c329ea3886a5b0e55586113": [], "8c": [], "8c75caed8f2462d63c7fd65e16c832b8f76cda331ac9e615e914ee80bac9": [], "8d": [], "8da8dd078b354a89602a875d310a0d725dad92b5b4d61069576e0a0e02e4": [], "8dd4d6de0fbba9d8f10d7b655be0578d5bda6e4db425210c265b0ea6c804": [], "8df4efa78df8b129847c8a7c0e492376cca62ab68453e5a20375a1c6291b": [], "8df927d3f0951cf67ca5973d89b35bcbda1777a4c78bf90a853d02d91285": [], "8e": [], "8f": [], "8f0c4a5bb9fd491c277c21eff7ccae71b47d43c4446c9d0c6cff2fe8c2c4": [], "8f8e631fcdc2ff978609eaeef1d6994bf2f028b59d9ac67640ed051f1218": [], "8k": 5, "9": [4, 5, 24, 28], "90": 4, "9048": 24, "9090": 4, "91": [], "917": [23, 25], "92": [], "9240": 24, "927e3a8899e52a27fa57a48607ff7dc91a9ebe97399b357b85a0c7892e00": [], "93": [], "9315": 4, "937": [0, 9], "9375917786cb39270b0ee6634536c0e22abf225825602688990d8f5c6c19": [], "9377bcb415797e44274b51d46e3249eba641711cf3348050f76ee7b15ffc": [], "93f7309eb40a9299c59a6637f13c21b08e585c569fee85901ccd55ce00f5": [], "94": [], "943": [], "94797cfe0263a30805f3074e535adfde02b885ac43d1e4dac85f82213b0b": [], "94c7dab8cfe7d41a23133634576fb89412e3430f28ca8d44411a77c2f18d": [], "95": [], "952": [0, 9], "953": [], "96": 11, "96142937f66150805c25c4d0f31ee4132fd33497753400734f9dfdcbdc66": [], "9748": 4, "98": [], "99": [], "9963d588cc3d75d766c819e0377a168ef83cf3316a92769971527a1ad1d": [], "9a": [], "9a683359ad2ed11b2303a7a94800db19c61d33fa3bde271df09e99936022": [], "9b": [], "9b2eab7833494e7c82f70c9b2f8e907d38231f4535704e3045a8a4960c8": [], "9c": [], "9cf1a409640adac045750b2ba9d1355c83942fbae74f21284c2133292be": [], "9eb14d4e9ef366be2020063d91c4f608294969fcd7b9fcc48153c64b9776": [], "9f1413bef53171f379d786aabc104d4abeea48ee84c553a3e3d8c9f96a9c": [], "9f1894efa1bb15e98613244b24dfbacfe2309e0ac3cfc27d4c608c2270d2": [], "9k": 5, "A": [0, 4, 5, 6, 8, 9, 11, 12, 17, 19, 21, 24, 26], "AND": 27, "AT": [], "And": [4, 11, 15], "As": [2, 3, 4, 6, 7, 8, 11, 14, 17, 19, 20, 22, 25, 27], "At": [8, 28], "BY": 5, "Being": [7, 20], "But": [4, 8, 19, 21, 24], "By": [13, 24, 26, 28], "For": [2, 4, 7, 8, 9, 11, 16, 19, 21, 22, 23, 24, 25, 27, 28], "If": [8, 9, 10, 11, 17, 19, 22], "In": [0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 27, 28], "It": [0, 4, 8, 18, 19, 21, 22, 24, 26], "Its": [4, 19], "NOT": 27, "No": [4, 10, 16, 24], "OR": 27, "Of": 8, "On": 2, "One": [8, 9, 11, 12, 28], "Or": [0, 4], "That": 21, "The": [0, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28], "Their": [25, 28], "Then": [11, 13], "There": [2, 11, 20, 21, 23], "These": [2, 4, 5, 6, 8, 11, 12, 13, 17, 19, 21, 23, 27, 28], "To": [2, 4, 10, 11, 12, 15, 19, 23, 28], "Will": [], "With": [4, 15, 19], "_": [8, 11, 22, 24, 28], "_0": [8, 11], "_1": 26, "_2": 26, "__getitem__": [4, 24], "__init__": [4, 24], "__len__": [4, 24], "_brownian": 10, "_c": [], "_end": 10, "_get_default_devic": [], "_i": 8, "_n": 26, "_q": 26, "_t": [8, 11], "a1": [], "a2": [], "a2t_project": 4, "a3": [], "a39c835871caca0173f526e321336a1a2b0961e38bf9b71b7213b651e3c8": [], "a4": [], "a5": [], "a6": [], "a61ef6f7faf98edadf4ce8094873d298f8582a3ec59b65c9174c516926e8": [], "a6c031bc1590789a3da14bd6a9cccc46c932401765d6d8f37e75c8214b44": [], "a7": [], "a8": [], "a812df4e2dd5696d1f351d58b8fe16a405b234ad2886a0dab9183fb78109": [], "a_i": 28, "aa": [], "aaa": [0, 18], "aaai": [0, 15], "aaron": [0, 17, 28], "ab": [0, 6, 8, 10], "ab44c871b0f07f491e5d2ad12c9bd7358e527510618cb1b803a88e986db1": [], "abbeel": 0, "aberman": [], "abhimanyu": [], "abhinav": [], "abhishek": 0, "abi3": [], "abil": [0, 8, 11, 12, 16, 17, 28], "abl": [7, 11, 12, 21, 23], "ablat": 12, "about": [0, 2, 3, 4, 7, 8, 9, 11, 16, 19, 21, 22, 23, 26, 28], "abov": [8, 11, 13, 14, 16, 19, 21], "abraham": [], "absent": 23, "absl": [], "absl_pi": [], "abstract": [0, 3, 6, 7, 8, 9, 22], "abu": [0, 8, 9], "academ": [3, 15], "acceler": [0, 15], "access": [3, 15], "accompani": [0, 2, 4, 5, 9, 13, 24], "account": [4, 10, 24], "accur": [2, 12, 13, 19, 28], "accuraci": [19, 26, 28], "achiam": [0, 18], "achiev": [3, 4, 8, 9, 13, 19], "aclanthologi": [0, 5, 8, 9], "acm": [0, 18, 25], "acoust": [0, 5, 8, 9, 16, 18, 23, 24, 28], "acquir": [], "across": [2, 13, 15, 19, 23, 25, 26, 28], "activ": [8, 11, 15, 27], "actual": [11, 16, 23, 26], "ad": [6, 11, 17, 19], "adaln": 11, "adam": [0, 5, 17, 18], "adamw": [4, 24], "adapt": [9, 11, 12, 13, 19, 28], "adb": [0, 5, 13, 18], "add": [4, 11, 24], "add_special_token": 4, "addit": [2, 8, 9, 11, 12, 15, 19, 21, 23], "addition": [2, 12, 13, 16, 21, 25, 27, 28], "address": [2, 3, 8, 9, 18, 19, 20, 23, 25, 27, 28], "adi": [0, 18], "aditya": [0, 28], "adjust": 19, "adler": [0, 18], "adob": 15, "adobephotoshopsenseiarteam": [], "adopt": [6, 8, 11], "advanc": [0, 3, 5, 7, 8, 12, 13, 15, 16, 17, 18, 20, 21, 23, 28], "advantag": [3, 9, 16, 17, 25, 28], "adversari": [0, 14, 17], "advis": 15, "ae": [], "ae30dadffc90b9006d77af76b393cb9dfbfc9629f339fc1574a1c52e6806": [], "aed7a284c00dfa7c0682d14df85ad4955a350a21d2e3b06d8240497359bf": [], "aeiou": [], "aesthet": [0, 9], "af": [], "af0d1f58f86002be0cf1e2665cdd6f7a4a71cdc8a7a9438cdc9e3b5375f": [], "affect": [10, 21], "after": [8, 10, 11, 19], "afternoon": 5, "again": 21, "against": [12, 26], "agarw": [0, 18, 28], "agent": [19, 20], "aggreg": [0, 6, 9], "aggress": 16, "agostinelli": [0, 5, 18], "agrawala": [], "ahead": 2, "ahm": [], "ahmad": [0, 18], "ai": [0, 8, 10, 15, 16, 20, 22, 24], "ai4cc": [], "aidan": 0, "aiesha": [], "aila": [], "aim": [3, 13, 17], "aiobotocor": [], "aiofil": [], "aiohappyeyebal": [], "aiohttp": [], "aioitertool": [], "aiosign": [], "aittala": [], "ajai": 0, "ajit": [], "aka": 11, "akash": [], "akhgari": [], "akhil": [], "akkaya": [0, 18], "aksan": [], "akten": [], "al": [4, 8, 9, 24, 25], "alaluf": [], "alan": [], "alban": [], "albert": 0, "album": 5, "alcap": [0, 9], "alec": [0, 18, 28], "alejandro": 0, "alek": [], "aleksand": [], "aleman": [0, 18], "alex": [0, 17, 18], "alexand": [0, 27], "alexandr": [0, 18], "alexei": [], "algorithm": [0, 10, 13], "ali": [], "alia": [], "alias_free_torch": [], "align": [0, 2, 9, 13, 21, 23, 28], "all": [0, 2, 3, 8, 9, 11, 12, 15, 19, 21, 26, 27], "allow": [2, 8, 9, 11, 15, 16, 19, 24, 28], "allud": 19, "almeida": [0, 18], "almost": [11, 13, 21, 23], "alon": 0, "along": [5, 7, 11, 27], "alongsid": [6, 8, 9], "alpha": [], "alphabet": 20, "alreadi": [10, 21, 28], "also": [2, 3, 6, 7, 8, 9, 11, 12, 13, 15, 16, 19, 20, 21, 23, 25, 28], "altenschmidt": [0, 18], "altern": [5, 8, 17, 18, 19, 21], "although": [8, 16, 19, 21], "altman": [0, 18], "alwai": [6, 8, 19], "amanda": [0, 28], "amaz": 24, "amazon": 5, "ambient": 28, "ambuj": [], "america": [], "american": [0, 23], "ami": [], "amir": [], "amirmojtaba": [], "amit": [0, 5, 9], "amodei": [0, 18], "among": [5, 6, 8], "amount": 19, "amp": 10, "amplitud": 19, "amu": [7, 9], "an": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "anaconda3": 10, "anadkat": [0, 18], "analogi": [0, 5], "analys": 7, "analysi": [0, 23, 25, 28], "analyt": 23, "analyz": [3, 26, 27], "anandkumar": [], "anchor": 28, "and82": [0, 11], "anderson": 0, "anderson2016": [], "andi": [], "andr": 0, "andrea": [0, 5, 18], "andrew": [0, 17, 18], "andrii": [], "angela": [], "angelo": 0, "anger": 27, "ani": [2, 3, 4, 10, 11, 19, 21, 22], "anil": [], "anima": [], "animesh": [], "anirudh": [], "anjali": [], "ann": 15, "anna": 0, "annot": [0, 5, 7, 16, 18, 23, 27, 28], "annotated_typ": [], "anoth": [4, 8, 19, 21], "ansel": [], "answer": [0, 3, 5, 6, 8, 16, 19, 22], "anthem": 24, "anticipatori": [0, 13], "antoin": [0, 5, 18], "antonio": [], "anygpt": 8, "anyi": [], "anyio": [], "anyon": 4, "anyth": [2, 11, 19, 20, 21, 22], "anytorch": [], "aouameur": 0, "ap": 26, "apach": 5, "apart": 28, "api": [0, 10, 19], "appdir": [], "appear": [8, 21, 22], "append": [11, 24], "appl": 15, "appli": [5, 12, 13, 15, 19, 21], "applic": [0, 3, 7, 9, 15, 17, 18, 19, 20, 21, 22, 27], "appreci": 4, "approach": [0, 2, 3, 6, 8, 9, 11, 13, 15, 17, 19, 21, 22, 23, 24, 25, 27, 28], "appropri": [12, 17, 25, 26, 27], "approx": 11, "approxim": 11, "ar": [0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "arab": [0, 8, 9], "arang": 24, "arash": [], "arbitrari": [21, 28], "arbor": 15, "architectur": [2, 3, 4, 7, 12, 13, 14, 16, 21, 22], "area": [12, 13, 15, 17, 19, 22, 25], "aren": [0, 5, 18, 19, 28], "argbind": [], "argpars": [], "aris": 3, "armi": [0, 6], "around": [4, 8, 10], "arrai": [4, 24], "arrang": 28, "arriv": 19, "art": [0, 2, 8, 9, 13], "arthur": 0, "articl": 19, "articul": 13, "artifici": [0, 8, 15], "artist": [2, 13, 17, 23, 27, 28], "artsiom": [], "arun": [0, 18, 25], "arxiv": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "ashish": 0, "ask": 19, "askel": [0, 28], "aspect": [8, 9, 12, 23, 28], "assess": [0, 6, 12, 26, 27], "assign": [9, 12, 13, 21, 28], "assist": 19, "associ": [0, 5, 8, 9, 18, 27, 28], "assum": 8, "ast": 28, "asttoken": [], "atin": [0, 5, 8], "attempt": [8, 17, 23, 27], "attend": [19, 28], "attent": [0, 2, 4, 8, 11, 14, 19, 28], "attention_mask": [4, 24], "attr": [], "attribut": [16, 17, 18, 23, 27], "atzmon": [], "audio": [0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 17, 18, 19, 20, 23, 24, 27], "audio_2023": 8, "audio_base64": 5, "audio_byt": 5, "audio_embedding_dim": [4, 24], "audio_forward": 24, "audio_html": 5, "audio_project": 24, "audio_sampl": 10, "audiogen": [0, 13], "audioldm": [0, 13], "audiolm": [], "audioread": [], "audioset": 5, "audiotool": [], "audit": [], "auditori": 18, "augment": [0, 5, 8, 9, 11], "august": [0, 6], "auraloss": [], "authent": 10, "author": 9, "auto": [0, 9, 14, 18], "autocast_mod": 10, "autoencod": [0, 11, 17, 19], "autom": [17, 20], "automat": [0, 7, 9, 15, 17], "automodel": [4, 24], "autonom": 20, "autoregress": [11, 13, 19, 22, 28], "autoregresst": 13, "autosav": 10, "autotoken": 24, "auxiliari": 5, "av": [], "avaiabl": 10, "avail": [4, 10, 17, 19, 24], "avent": 0, "avenu": 2, "averag": [6, 12, 25, 26], "avoid": 10, "awai": [5, 6, 11, 24], "awar": 9, "ax": [], "axel": [], "axi": [7, 11, 16], "ayan": [], "ayh": [], "azalea": [], "b": [0, 10, 19], "b1": [], "b161908e2f51be56568184aeb4a880fd287178d176fd1c860d2217f41106": [], "b2": [], "b3": [], "b4": [], "b6": [], "b64encod": 5, "b67ebd7e19ffe259f05d3cf4547326725c3113d640c277030be3e9998d6f": [], "b7": [], "b8": [], "b86984bed139586d01532a587464b5805f12e397594f19f931c4c2fbfa61": [], "b9": [], "b95df0b8593aee5d9e68b9a9f24e83c69657afb46b24f83b57098d926401": [], "b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc": [], "ba": [], "ba44652d562cbf0bf320e0f3810206149c8a4e99cdbf66da82e97ab53a15": [], "bach": [0, 13, 17, 18], "back": [10, 11, 13, 19, 20, 23, 27], "backbon": [8, 11], "background": 23, "backpropag": [], "backward": [4, 24], "bad": 2, "bahjat": [], "bahri": [], "bai": [], "baid": [], "balaji": [], "balanc": [8, 26], "balog": [0, 18, 25], "banjo": 24, "bao": [], "bar": [4, 11], "barn": [], "barret": [0, 18], "barrett": [], "barrington": [0, 17, 18, 27], "barron": [], "bart": 8, "barzilai": [], "base": [0, 3, 5, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "base64": 5, "baselin": [9, 19, 28], "bash": 10, "basi": 5, "basic": [4, 17, 18, 19, 21, 22, 24], "bass": 28, "batch": [4, 10, 24, 28], "batch_siz": [4, 24], "bay": [], "bb9ff095ae7b1b6908480f683b6ca6b71c2105d343a5e5cb25334b01f5fa": [], "bc": [], "bd": [], "bdt": [], "be958fefa589186b54daaa9a72fa1a2e19e42a2dcab87ee15c8273259da0": [], "beach": 28, "beat": [0, 5, 18, 24], "beatl": 28, "beauti": 4, "becaus": [11, 12, 16, 17, 19, 20, 21, 22, 24, 26], "becom": [2, 8, 9, 13, 16, 19, 20, 21, 27], "beeler": [], "been": [2, 9, 10, 11, 12, 13, 15, 19, 21, 22, 23, 27], "befor": [10, 11, 12, 19, 27], "began": 13, "begin": [18, 21, 24, 28], "behav": 19, "behavior": [4, 19, 23, 24], "behind": [12, 19], "being": [3, 9, 11, 16, 19, 21, 23, 26, 28], "believ": 15, "bell": [], "below": [5, 6, 8, 10, 11, 19, 21, 26, 28], "belt": 4, "ben": 0, "benchmark": [0, 6], "benefit": [7, 9, 15], "beneto": [0, 5, 6, 8, 9, 15, 18, 23, 28], "bengio": [0, 17], "benjamin": [0, 8], "benno": [0, 5, 6, 23, 28], "benzi": [], "berard": [], "berg": [0, 8, 15, 18, 28], "bergman": [], "bermano": [], "bernard": [], "bernhard": 0, "bert": [0, 6, 11, 18, 21, 22, 24, 28], "bertin": [0, 17], "bespok": 2, "best": [2, 3, 11, 15, 19, 27, 28], "beta": [], "beta_": 8, "bethard": [0, 5, 8, 9], "better": [4, 6, 8, 9, 11, 19, 20, 22, 23, 24, 25, 26, 28], "between": [3, 6, 7, 8, 9, 12, 13, 15, 17, 19, 20, 23, 25, 26, 27, 28], "beyond": [0, 9, 17, 18, 19, 25, 26], "bhe23": [], "bi": 28, "bia": [20, 24], "bian": [], "bias": [12, 20], "bichen": [], "bidirect": [0, 11, 18, 28], "big": [11, 19], "bigger": [19, 20], "biggest": 20, "bigvgan": [], "bilei": [], "billion": 19, "bin": [], "binari": [26, 27], "bing": [], "bingchen": [], "biomed": [], "bit": [8, 10, 11], "bittner": [0, 8, 9, 18], "bj": [], "bjd": [], "black": [], "blank": [19, 21, 22], "blap": [0, 8], "blattmann": [], "bleach": [], "blend": [4, 13, 24], "bleu": 6, "bleu_1": 6, "blob": 10, "block": [8, 10, 11, 14], "blocker": 20, "blog": [0, 11, 18], "blown": 24, "blue": [5, 13, 16], "blurri": 19, "bmv": [], "bnh": [], "bo": [], "bockkschlut": [], "bockkw16": [], "bodganov": [0, 5, 23], "bodi": 2, "boesel": [], "bogdanov": [0, 6], "bohan": [], "boissier": [], "bokeh": [], "boldsymbol": [8, 11], "bolei": [], "book": [3, 4, 15], "booktitl": [], "boolean": [18, 27], "boost": 2, "bootstrap": [0, 8], "borgeaud": [], "bori": 0, "borrow": 6, "borso": [0, 5, 18], "bos_embed": 4, "bos_token_id": 4, "bosma": [0, 18], "bot": 20, "both": [2, 4, 8, 9, 11, 12, 19, 25, 26, 27, 28], "botocor": [], "bottleneck": [13, 14], "bottom": 21, "boyer": [0, 9], "bpe": [21, 28], "bpm": 10, "braceexpand": [], "brahma": [0, 18], "bram": [], "brandon": [0, 9], "brass": 5, "braun": [], "break": [0, 2, 4, 8, 11, 24, 28], "breakthrough": 13, "breathtak": 4, "brebisson": [], "bresson": [], "breviti": 6, "brian": [0, 17, 18, 27], "bridg": [0, 3, 5, 8, 9, 13, 18, 27, 28], "briefli": [6, 16, 19], "bright": 16, "bring": 19, "broad": [2, 11, 13, 22], "broadcast": 21, "broader": [23, 27], "brockman": [], "broken": 8, "brook": [], "broomel": [], "brownian_interv": 10, "brows": 27, "browser": [10, 24], "brox": [], "brualla": [], "bruno": 0, "bryan": [0, 5, 18], "bsv": [], "btyld23": [], "budget": [8, 19], "build": [2, 11, 15, 23, 25, 28], "built": [11, 24, 27, 28], "bulid": 24, "bunch": 19, "burcu": [], "burgeon": 15, "burovski": [], "byte": [5, 21, 28], "bytecod": [], "byted": 15, "c": [0, 6, 8, 11, 16, 17, 19], "c1": [], "c13ea695a4393639830bf96baea956538ba7a9d06fcce7cef10bfff20f72": [], "c188ac517f402775b90d6f312955a5e53b866c964b32119f2ed76315697": [], "c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6": [], "c2": [], "c316262244abea7481f95f1e91d7575f3dfcf6455d56d1ffe9839c582eb1": [], "c4": [], "c463dc5fc02fbe019566d067a9d18746cd3c664f29c9b8b3c3f9ed025365": [], "c4dm": [], "c5": [], "c6": [], "c691e6c5d925a364d63eec27d1f10477ca7902febe10a8e1f86284dba754": [], "c869a1fbd481dcb02c70032fd6a7243de7582bc48c7cae03d6f0985a11c0": [], "c8bfa8cbcd3ea1d25d2beb359b5c5a3f4339a7e2e5d9e3ef3e29ba3ab3b9": [], "c9b96572ab7994e73c64588f8875741823f2daba70e746547fff9a2d9a54": [], "ca": 15, "cacer": 0, "cach": [], "cacul": 12, "cai": [], "caillon": [0, 5, 18], "calcul": [12, 19, 21, 26], "california": [15, 28], "call": [4, 8, 10, 11, 16, 21, 22, 24, 25], "cambridg": [], "came": 28, "campaign": 20, "can": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "cancel": [], "candid": [6, 15], "cangea": [], "cannot": [9, 11, 17, 23, 27, 28], "cao": 0, "cap": [4, 24], "capabl": [3, 8, 13, 15, 17, 19, 23, 25, 28], "capac": 4, "capit": 11, "caption": [0, 2, 4, 5, 6, 7, 8, 11, 15, 16, 18, 24, 28], "caption2emb": 24, "captiv": [4, 24], "captur": [4, 6, 7, 8, 9, 17, 22, 23, 24, 25, 28], "carbonneau": 0, "care": [8, 23, 25, 26], "carefulli": [20, 26, 28], "carlo": [], "carnovalini": [], "carol": [0, 5, 9], "carr": 0, "carri": 8, "carrol": [0, 18], "casagrand": [], "cascad": 17, "case": [2, 5, 7, 8, 9, 10, 11, 12, 21, 23], "caseb": 0, "casei": [], "cast": 8, "casual": 5, "cat": [4, 16, 24], "catalog": 27, "catanzaro": [], "catchi": 24, "categor": [9, 21], "categori": [2, 9, 10, 11, 26, 28], "cater": 15, "caus": 11, "causal": [21, 28], "cb": [], "cc": [4, 5, 24], "cc3a402a6439c15c3d4294333e13042b915bbeab54edc457c723931fed3f": [], "ccf007edf442c3c0cd3a98be2c82bc99edc957c04436a759b6e1e01077e0": [], "cck": [], "cd": 15, "cd10c82398f3b39bbf60a300e09c931bdf6844f3f2fba9ab2b5981501f9f": [], "cdescrivan17": [], "cdot": [11, 28], "cdz": [], "ce": 0, "ce21": [0, 2], "ce6964e9f8822f6e63ebc59bdcc5ae445126b7356da63188fa0e6265054": [], "cell": [4, 10, 16, 24], "celma": [0, 17], "celso": [], "cem": [], "center": 4, "centr": 15, "central": [19, 26], "certain": [19, 21, 22], "certainli": 11, "certifi": [], "cf": [], "cffi": [], "cfg": 10, "cfg_scale": 10, "cfs16": [0, 8], "cfsc17": [0, 9], "chaganti": [0, 18, 25], "chakrabarti": [], "challeng": [0, 3, 6, 7, 13, 15, 17, 18, 19, 22, 27, 28], "cham": [], "chan": [], "chanan": [], "chang": [0, 2, 18, 19, 22, 28], "changli": [0, 8], "changx": [], "changyou": [0, 9], "channel": [2, 11, 17, 19], "chao": [0, 8], "chaowei": [], "chaoyu": [], "chapter": [3, 16, 18, 22, 23, 25], "character": [], "characterist": [4, 6, 8, 9, 23, 27, 28], "charli": [], "charset": [], "chartmetr": 15, "chat": [22, 25], "chatgpt": [16, 19, 22], "chatthe": [], "chaudhuri": [], "chauhan": [], "che23": [], "cheaper": 21, "cheapli": 19, "cheat": 12, "chechik": [], "check": [4, 10, 11, 16, 24], "chelsea": [], "chemistri": [], "chen": [0, 5, 6, 8, 9, 15, 18, 23, 28], "cheng": 0, "chenji": 0, "chenlin": [], "chenshuo": [0, 5, 8], "chenyang": [], "chet": 0, "cheung": 0, "chia": 0, "chiang": 0, "chieh": 0, "child": [0, 18], "chinchilla": 19, "ching": [], "chintala": [], "chitwan": [], "chiu": [], "chiyuan": [], "chl": [0, 18], "cho": [0, 9], "cho_unifying_2021": [], "choi": [0, 8, 9, 17, 18, 23, 25, 28], "choic": [4, 6, 8, 19, 24, 28], "chong": [0, 18], "chongxuan": [], "choos": 8, "choppi": 5, "choral": [0, 13], "chord": [2, 5], "choru": 2, "chosen": 9, "chou": [0, 17, 18], "chourdia": [], "chri": [0, 17, 18, 28], "christian": [], "christin": [0, 18], "christina": [], "christoph": 0, "chronolog": 11, "chu": [], "chu_qwen": 8, "chul": [], "chun": [], "chung": [0, 18], "cider": 6, "cinjon": [0, 17], "circul": 2, "circumv": 11, "cite": [], "citep": [], "cites": [], "citi": [0, 5, 8, 9], "cj": 0, "ck": [], "ckg": [0, 13, 14, 18], "ckm": [], "ckp": [], "clamp": [10, 24], "clap": [11, 12], "clariti": 11, "clark": [0, 28], "class": [2, 16, 21], "classic": [4, 5, 11, 16, 24, 27], "classif": [0, 3, 7, 12, 17, 18, 19, 27, 28], "classifi": [9, 10, 27], "claud": 19, "clean": [10, 11], "clean_fid": [], "clean_up_tokenization_spac": [4, 24], "cleaner": 11, "cleanli": 11, "clear": [6, 8, 9, 11], "clever": [11, 19], "clich\u00e9": 13, "click": [], "client": [], "clip": [4, 9, 10, 15, 16, 19], "clip_anytorch": [], "clone": [4, 15], "close": [8, 13, 19, 28], "closer": [5, 28], "closest": 17, "clpn19": [0, 18, 28], "cluster": [], "clz": [0, 18, 25], "cn24": [], "cnn": 8, "co": [0, 8, 10, 12, 15, 28], "coars": [2, 9], "coca": [0, 16], "code": [0, 3, 11, 15, 18, 19, 28], "codec": [11, 12, 18], "coeffici": 11, "cohen": [], "coher": 25, "col10": [], "col12": [], "colab": [4, 24], "colin": [0, 18], "collab": 10, "collabor": 28, "collect": [0, 7, 17, 19, 20, 27], "colleg": 15, "collin": [], "colloqui": 23, "color": 16, "colorcet": [], "com": [0, 4, 10, 15, 24], "combin": [3, 6, 8, 11, 12, 14, 15, 24, 27, 28], "come": [6, 8, 9, 11, 19, 20], "comm": [], "command": 10, "common": [4, 6, 7, 8, 19, 21, 28], "commonli": [6, 12], "commun": [2, 15, 16, 17, 28], "compani": 19, "companion": [], "compar": [6, 7, 12, 13, 16, 17, 18, 19, 21, 25, 26, 27, 28], "comparison": [6, 19, 26], "compat": [], "compil": [], "complement": 12, "complementari": 26, "complet": [3, 11, 15, 19, 23, 24, 26, 28], "complex": [3, 7, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 27, 28], "compon": [3, 4, 6, 8, 18, 19, 21, 25, 28], "compos": [2, 8, 13, 25], "composit": [0, 13, 15], "comprehend": [8, 15], "comprehens": [3, 12, 15, 26, 27, 28], "compress": [0, 11], "compris": [], "comput": [0, 5, 6, 8, 9, 11, 12, 13, 15, 17, 18, 19, 21, 26, 27, 28], "computation": [2, 21], "concaten": [2, 8, 11, 14], "concept": [4, 9, 11, 13, 16, 18, 19, 24, 26, 28], "conceptu": 11, "concern": [11, 19], "conclud": [3, 11], "conda": [10, 15], "condens": 11, "condit": [0, 2, 3, 4, 9, 10, 14, 17, 18, 19, 22], "conduct": [3, 23], "confer": [0, 5, 6, 8, 9, 15, 17, 18, 25, 28], "confid": 12, "config": [], "configpars": [], "configur": [], "cong": [0, 6], "congratul": [3, 24], "connect": [8, 17, 21, 24, 28], "connectionist": 0, "connelli": [], "consecut": 22, "consensu": 6, "consequ": [8, 28], "consid": [4, 6, 9, 12, 20, 22, 23, 26, 28], "consider": [2, 26], "consist": [0, 4, 5, 8, 13, 14, 15, 17, 28], "consistut": [], "consolid": 8, "constabl": [], "constant": [5, 19], "constantin": [0, 8], "constitu": 28, "constrain": [9, 28], "constraint": 13, "construct": [4, 28], "consum": 12, "consumpt": 20, "contain": [2, 5, 9, 15, 23, 25, 27], "contemporari": [4, 24], "content": [0, 2, 3, 4, 5, 7, 9, 15, 17, 23, 24, 27, 28], "context": [6, 8, 11, 20, 21, 23, 25, 28], "contextu": [6, 16, 19, 21, 22, 23, 25, 28], "contigu": 4, "continu": [3, 6, 12, 14, 19, 21, 22], "contourpi": [], "contrast": [0, 3, 12, 16, 18, 19, 24, 28], "contribut": [13, 15], "contributor": [], "control": [0, 5, 11, 13, 15, 17, 18, 19, 20, 22, 28], "controlnet": [0, 2, 18], "convei": [4, 9, 13], "conveni": 6, "convent": [15, 25], "converg": 0, "convers": [0, 3, 8, 12, 15, 18, 22, 23], "convert": [10, 11, 14, 18, 28], "convert_tokens_to_id": 4, "convolut": [0, 9, 11, 14], "cooijman": [], "cook": [0, 9], "copet": [0, 18], "copi": 4, "copyright": 5, "core": [5, 8, 11, 28], "corner": 12, "corpora": [16, 28], "corpu": [0, 5, 21, 22, 23], "corpusid": 0, "corr": 0, "correct": [4, 11, 19], "correctli": [19, 26], "correl": [2, 6], "correspond": [9, 11, 12, 19, 21], "corrupt": [11, 28], "cosin": [6, 12, 26], "cosmo": 0, "cost": [11, 12, 19, 20, 27], "costli": [], "cot": 19, "could": [11, 17, 21, 23, 25, 27, 28], "couldn": [], "count": [8, 22, 26], "countri": 24, "coupl": 7, "cours": [8, 19], "courvil": [0, 17], "cover": [7, 9, 15, 16, 17, 21, 22, 23, 27], "coverag": [23, 27], "cp26": [], "cp27": [], "cp311": [], "cp32": [], "cp33": [], "cp34": [], "cp35": [], "cp36": [], "cp37": [], "cpcd": 25, "cpjku": [], "cpp": [], "cpu": [4, 10, 24], "cqt": [], "crawl": 5, "creat": [2, 3, 5, 6, 7, 12, 15, 17, 19, 23, 25, 26, 27, 28], "create_audio_html": 5, "creation": [0, 13, 15, 18], "creativ": [0, 2, 8, 13, 17], "cref": [], "criteria": [12, 23, 26, 27], "criterion": 4, "critic": [20, 26, 27, 28], "crop": 2, "cross": [2, 8, 11, 14, 19, 28], "cross_entropi": [4, 24], "crossentropyloss": 4, "crowdsourc": 5, "crucial": [9, 12, 26, 28], "csl": 13, "csrc": [], "cuda": [4, 10, 24], "cue": [13, 18], "cultur": [20, 23, 28], "cun": [], "curat": [0, 18, 25], "current": [2, 3, 6, 8, 9, 10, 13, 18, 21, 23, 25], "curti": 0, "curv": 11, "custom": [2, 5, 19, 24], "cut": 18, "cutoff": [20, 26], "cvf": [0, 5], "cvpr": [0, 5, 15], "cvpr52688": 0, "cvpr52729": [0, 5], "cvsf23": [], "cwbergkirkpatrickd20": [0, 13], "cwbkd20": [], "cwl": [0, 11, 13, 18], "cxh": [], "cxz": [0, 28], "cxzg16": [], "cyclegan": [], "cycler": [], "cyran": 0, "cyril": [0, 17], "czj": [0, 13], "d": [0, 4, 10, 11, 15, 18, 19, 24], "d1": [], "d110f0a43beb365758a252203c43eaaad169fe7749da918869a8c991f726": [], "d1e337b9b4c8ea3aae5d399ace8c9cf4c2a7789cfe9d14766511fbc83c8b": [], "d2": [], "d23a97e0a2c690d40b165d1062e2c4ccc796be458a1ce59f6ba030434663": [], "d2805324fb746d8da86d3844bee4f55c0cfd6c136de61b713772d44c5bea": [], "d3": [], "d4": [], "d497a310bde3f01cb805196ac61b7ad6dc5dcf8dce66634dc34364b20b4f": [], "d5": [], "d78dc063216e62fc55f6b2eebb447f6a4b0a59f55c8406376f76bf959b08": [], "d8": [], "d9": [], "d_": 8, "d_c": 11, "d_h": 11, "d_k": [], "d_t": 11, "d_w": 11, "da": [], "dabeaf902892922777492e1d253bb7e1264cadce3cea932f7ff599e53fea": [], "dac": 11, "dacheng": [], "daeyong": [0, 23, 25], "dahl": [], "dai": [0, 5, 9, 18, 28], "daiq": [], "dall": 19, "damien": [], "dan": [], "danc": [2, 5, 16], "danceabl": 5, "dang": [], "daniel": [0, 5, 8, 9, 18, 28], "danilo": 0, "dannenberg": [], "dao": [], "dao23": [], "dara": [], "dario": [0, 18], "dark": 16, "dasaem": [0, 28], "data": [0, 3, 5, 7, 8, 9, 11, 12, 13, 16, 17, 19, 20, 21, 23, 25, 27], "databas": [17, 19, 24, 26, 27], "datafram": 5, "dataload": [4, 24], "dataset": [0, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 23, 25, 27, 28], "date": [13, 19, 23], "dateutil": [], "daunt": 20, "davi": [], "david": [0, 17, 18, 27], "dawen": [], "dazhong": [], "db": 24, "db99aa669eee301966bc6c997d60a0240f9cecae63f044b2e5a5310e4bf7": [], "dbvb17": [], "dc39062efec7515add304b98a54da2948709a808": [], "dcd": [0, 13], "dck": [0, 23, 25], "dcln23": [0, 8, 18], "dclt18": [0, 18, 28], "dcr": [0, 2], "dcsa22": [0, 11], "dctorch": [], "dd": [], "ddp09": [], "ddpm": [0, 2], "ddsp": [0, 13], "de": 11, "de3276d773ab6ce3ad676df5fab5aac19696b2956319d65d7dd88fb10f19": [], "deadlock": 10, "deaf": 7, "deal": [8, 9, 11, 21, 27], "decemb": [0, 8, 9], "decid": 19, "decis": 19, "decod": [3, 4, 5, 11, 14, 17, 18, 24], "decompos": 21, "deconvolut": 14, "decor": [], "decreas": 19, "dedic": [10, 11], "deep": [0, 4, 7, 8, 9, 12, 13, 15, 17, 18, 22, 24, 28], "deepak": [], "deepanwai": [0, 5], "deepbach": [0, 13], "deeper": [11, 15, 19, 25], "deepfak": 20, "deepli": 17, "deepmind": 15, "def": [4, 5, 24], "default": [4, 10, 24], "defferrard": [], "defin": [2, 4, 8, 10, 11, 18, 21, 22, 24, 26], "definit": [8, 18, 22], "defossezcsa23": [0, 14], "degara": [], "degre": [], "dehghani": [0, 18], "dekel": [], "delet": 10, "delic": 4, "delight": 3, "deliv": [4, 24], "delta": 28, "delv": [15, 18], "demo": [0, 8, 10], "demonstr": [3, 14, 16, 19, 25, 28], "den": [0, 17, 28], "deng": [0, 5, 8, 9], "dengsheng": [0, 17], "denk": [0, 5, 18], "denois": 0, "denot": [8, 9, 11, 22, 28], "dens": 28, "densiti": [2, 11], "denton": [], "depart": 15, "departur": 15, "depend": [6, 8, 9, 10, 11, 12, 15, 19, 21, 22, 28], "deploi": [], "depract": [4, 24], "depth": [3, 18], "deriv": [5, 11, 12, 15], "desc": [4, 24], "descent": 22, "describ": [0, 2, 4, 5, 7, 8, 9, 16, 23, 27], "descript": [0, 3, 5, 6, 13, 15, 16, 19, 23, 24, 27, 28], "descript_audio_codec": [], "descript_audiotool": [], "description_evalu": [], "description_model": [], "description_models_t": [], "description_task": [], "descriptor": 9, "deserv": [], "deshmukh": [0, 8], "desideatum": 12, "design": [2, 4, 6, 7, 8, 9, 10, 11, 12, 23, 28], "desir": [17, 19], "desktop": 19, "desmaison": [], "despit": [17, 21], "dessert": 3, "desw23": [0, 8], "detach": [4, 24], "detail": [2, 4, 6, 8, 9, 12, 13, 15, 16, 18, 21, 22, 24, 28], "detect": [20, 27], "determin": [11, 19, 26], "develop": [0, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19, 22, 25, 27, 28], "devi": 0, "devic": [4, 10, 24], "device_typ": 10, "devin": [], "devis": 8, "devito": [], "devlin": [0, 18, 28], "df": 5, "df18d492a8f00d29a30db307904b9b296e37507034eedb523876f3a2e13": [], "df4b9b42f2be0b623cbd5e2140cafcaa2bef0759a00b7b70104dcfe2fb51": [], "df630c387a0a054815d60be6a97eb4e8f17385d5d6fe660e1c02750062b4": [], "dhabi": [0, 8, 9], "dhariw": [0, 18], "dhyy18": [0, 13], "di": 0, "dialog": 18, "dialogu": [0, 6, 7, 9, 15, 23, 25], "dickstein": 0, "dict": [], "did": [8, 21], "diederik": 0, "diego": 15, "dieleman": [0, 11, 17], "diff": [0, 2], "differ": [4, 5, 6, 7, 8, 9, 11, 12, 14, 17, 23, 24, 26, 27, 28], "differenti": [0, 2, 8, 11], "differnt": [], "difficult": 19, "difficulti": [16, 19], "diffus": [0, 2, 3, 10, 13, 16, 18, 19], "diffwav": [], "dig": 22, "digit": 0, "dim": [4, 24], "dimens": [11, 21, 23], "dimension": [11, 28], "dimitra": [], "dinculescu": 0, "ding": [], "dinh": [], "diogo": [0, 18], "direct": [3, 7, 8, 11, 15, 19, 23, 28], "directli": [2, 5, 8, 10, 11, 13, 19], "disabl": 10, "disadvantag": 3, "discard": 23, "discount": 6, "discov": [23, 25, 27], "discoveri": [0, 15, 23, 25], "discret": [2, 3, 11, 14, 18, 19, 21], "discrimin": [11, 14, 17, 19, 28], "discuss": [2, 3, 7, 8, 9, 11, 12, 15, 18, 19, 20, 25, 28], "dispatch": 19, "displai": [4, 5, 10, 24], "dispos": 8, "dissimilar": 4, "dist": [4, 24], "distanc": [0, 24, 26, 28], "distil": 0, "distinct": [12, 13, 27, 28], "distinguish": [0, 4, 7, 8, 17, 18, 28], "distribut": [2, 5, 8, 11, 12, 17, 19, 21, 22], "dit": 11, "ditto": [0, 2, 15, 18], "div": 10, "diverg": 12, "divers": [0, 15, 18, 23, 25], "dixon": 0, "djgd21": [], "djp": [0, 13, 18], "dkb14": [], "dl": 8, "dljn24": [0, 28], "dmitri": [0, 5, 6, 23], "dmitrii": [], "dml": [0, 5, 8, 9], "dmp18": [], "dmp19": [0, 17], "dn21": [0, 11], "do": [0, 2, 8, 11, 15, 19, 21, 22], "do_sampl": 4, "doc": 10, "docker": [], "docker_pycr": [], "dockhorn": [], "docnam": [], "docstr": [], "docstring_pars": [], "doctor": 15, "document": [0, 4, 9, 24], "doe": [6, 9, 10, 11, 12, 19, 21, 24], "doesn": [2, 11, 19, 23], "doh": [0, 5, 8, 15, 18, 23, 25, 28], "doi": [0, 5, 6, 8, 9], "domain": [0, 2, 3, 4, 8, 11, 12, 13, 14, 15, 16, 17, 21, 28], "domin": 4, "dominik": 0, "don": [3, 8, 10, 19, 21, 23, 27], "donahu": [0, 17, 18], "donald": [], "done": [6, 10, 11], "dong": [0, 5, 9], "dongchao": [], "dongdong": [], "dongjun": 0, "dorien": [0, 5], "doshi": [], "dot": [9, 12, 28], "dougla": [0, 9, 17, 18, 27], "down": [11, 28], "downbeat": [], "download": [10, 24], "downsampl": 11, "downstream": [15, 16], "dpm": [], "dpmpp": 10, "dpo": 19, "dramat": [25, 28], "draw": 12, "drawback": 9, "dreambooth": [], "dreamfus": [], "drift": 11, "drive": 13, "driven": [0, 8, 17, 18], "drop": [0, 28], "drop_last": [4, 24], "dropout": 28, "drum": 2, "dsdb16": [], "dtype": [4, 24], "du": [0, 18], "duan": 0, "dubei": [], "dubnov": [0, 8, 9, 18, 28], "duc": 0, "due": [13, 20, 21, 25], "duet": 24, "duh": [0, 5, 8, 9], "dumoulin": [], "dung": [], "durand": [0, 8, 9, 18], "durat": [10, 13], "dure": [12, 15, 18, 19, 27, 28], "dvdos18": [], "dwcn23": [0, 16, 18, 28], "dylan": [], "dynabert": [], "dynam": [0, 13, 19, 24, 28], "d\u00e9fossez": 0, "e": [0, 2, 4, 5, 8, 9, 11, 16, 19, 21, 22, 25, 26, 27, 28], "e0": [], "e07ce413d16ef64e885bea37551eac4c5ca3ddd440933f9c94594273d0d9": [], "e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4": [], "e1127810de8b60a58bfa682f858fd7ba36667d29c0b9ad3b6ff10d6cb944": [], "e1956f7ca582a22dd1f17b9e26fcb8229051b0ce6d33b47227824772feec": [], "e2": [], "e3": [], "e4": [], "e5": [], "e7": [], "e8": [], "e8c04e80e82391a6e51f218ca49720f64236bc824e92152a2633b74cf7ab": [], "e9": [], "e9fcff7623954d86bdc17782036cbf715ecab1bec4847c008557affe1ca8": [], "e_": 8, "ea": [], "each": [6, 8, 9, 11, 12, 17, 19, 21, 23, 25, 26, 27, 28], "ead346e904390a53e71b5da2df7e7839abb16e967ba07fa15addf1f9f37c": [], "earli": [8, 13, 19, 28], "earlier": [8, 19, 26], "earliest": 8, "easi": 8, "easier": [7, 17, 19], "easili": 17, "easy_gener": 10, "eb": [], "ebnj33fcrl": 0, "ec": [], "ecal": [5, 28], "eck": [0, 17], "econom": 20, "economi": 20, "ect": [0, 11, 13], "ed": [], "edg": 18, "edgar": [], "edict": [], "ediff": [], "edit": [0, 2, 5], "editor": [0, 5, 8, 9], "edmsound": 0, "educ": [7, 15], "edward": [], "edwin": [], "ee": [], "ee39c6e92acc742c052f137b47c210cd0a1b72dcd3f98495528bb4d27761": [], "eerili": 11, "eess": [0, 6, 8], "effect": [0, 5, 12, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "effici": [0, 8, 9, 11, 13, 15, 28], "effort": [20, 25], "efro": [], "egregi": 20, "ehgr20": [0, 13], "ehohc": [], "ehsan": [], "eikan": [], "einop": 10, "einops_ext": [], "einsum": 24, "either": [6, 8, 9, 10, 11, 19, 28], "elabor": 12, "elbmg07": [0, 17], "electr": 24, "electrifi": [4, 24], "electron": [0, 5, 16, 28], "element": [9, 10, 24, 28], "elena": [0, 8, 9], "eleph": 2, "eleventh": 0, "eli": [], "elia": [], "elio": [0, 6, 8, 9, 15, 18, 28], "elizald": [0, 8], "ell": 11, "elli": [0, 18, 28], "ellison": [], "eloi": [], "els": [4, 10, 16, 24], "elsen": [], "elucid": [], "ema": [], "ema_pytorch": [], "emanuel": 0, "emb": [4, 11], "embed": [0, 2, 3, 4, 6, 8, 11, 12, 14, 16, 18, 19, 21, 23, 24, 25, 26], "embedding_cat": 4, "embedding_prefix": 4, "embedding_text": 4, "embeddings_2d": 16, "emed": 28, "emerg": [8, 9, 13, 15, 19, 25], "emili": [], "emilian": 0, "emir": [0, 8, 9], "emmanouil": [0, 5, 6, 8, 9, 15, 18, 23, 28], "emmanouilid": [], "emnlp": [0, 8, 9], "emot": [0, 4, 9, 13, 20, 24], "emphas": [18, 25, 26], "emphasi": [15, 17], "empir": [0, 8, 9, 19], "emploi": [8, 14, 20, 21], "emr": [], "emu": [], "en": 10, "enabl": [4, 5, 7, 8, 13, 15, 16, 18, 19, 23, 24, 25, 27, 28], "enchant": 4, "encod": [3, 4, 11, 13, 14, 17, 18, 24, 28], "encodec": [11, 14, 19], "encompass": [9, 23], "encount": 28, "encourag": [11, 15, 19, 28], "end": [0, 5, 6, 11, 17, 21, 24], "endeavour": 6, "energet": [24, 28], "energi": 20, "enforc": 21, "engag": 25, "engel": [0, 5, 17, 18], "engin": [15, 17, 20], "english": [5, 19, 20, 21], "enhanc": [0, 3, 5, 9, 13, 18, 19, 28], "enjoi": 9, "enorm": [], "enough": [4, 19], "ensembl": [], "ensur": [4, 12, 26, 28], "enter": 17, "entir": [8, 9, 11, 12, 19, 23, 28], "entiti": 28, "entropi": [12, 14, 28], "enumer": 16, "env": 10, "envinro": 10, "environ": [10, 15], "eos_token_id": 4, "eot": 21, "ep": [], "epc": [0, 2, 11], "epoch": [4, 24], "epoch_loss": [4, 24], "epstein": [], "epur": [0, 8, 9], "equal": 4, "equat": [0, 11], "equilibrium": 0, "equit": 20, "equival": 19, "er": 20, "era": [9, 15], "eri75": [], "eric": [], "erich": [], "erickson": [], "erik": [0, 9], "ermon": 0, "err": [0, 13, 17], "error": 19, "escap": 5, "escriv": [], "esl": 0, "especi": [9, 11, 19, 20, 21], "essenti": [11, 12, 18, 19], "esser": [], "establish": [8, 9, 15, 26], "estim": [0, 27], "et": [4, 8, 9, 24, 25], "eta": [], "etc": [11, 16, 21, 27], "ethan": 0, "euclidean": 26, "eugen": [], "eunggu": [], "evad": [4, 24], "eval": [4, 24], "evalu": [0, 3, 5, 7, 8, 9, 13, 15, 18, 19, 23, 24, 28], "evan": 0, "even": [2, 4, 6, 9, 12, 19, 28], "event": [9, 23], "ever": 24, "everi": [8, 19, 21], "evgeni": [], "evolut": [3, 7, 9, 18, 23, 25], "evolv": [13, 17, 23, 28], "exact": [6, 8, 11], "exactli": [21, 24], "exam": 21, "examin": [3, 18, 28], "exampl": [2, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 19, 21, 23, 25, 27, 28], "excel": [12, 16], "except": 21, "excit": [2, 3, 16, 24], "exclus": 23, "execut": 19, "exemplifi": [13, 20], "exercis": [2, 18], "exhibit": 20, "exist": [2, 5, 7, 8, 11, 16, 17, 19, 28], "exit": [], "exp": [8, 24, 28], "expand": [4, 5, 13, 27, 28], "expect": 9, "experi": [0, 4, 9, 23, 24, 25, 28], "experiment": [0, 28], "expert": [], "expertis": 15, "explain": [7, 15, 27], "explicit": 23, "explicitli": 10, "exploit": [], "explor": [0, 3, 9, 13, 15, 18, 23, 24, 25, 28], "export": 10, "expos": 28, "express": [0, 2, 9, 13, 16, 23, 25, 28], "expressivenss": 13, "ext": [], "extend": [0, 8, 24], "extens": [19, 21], "extern": [0, 19], "extra": 11, "extract": [2, 4, 11, 24, 28], "extractor": 8, "extrem": 21, "f": [4, 5, 6, 9, 10, 11, 24, 28], "f0": [], "f0b9ad6c0a9017e62d4735daaeb11ba3b6c009d69a26141b258cd37b5588": [], "f185bfd0ca1d213beb4293bed51d92254df23d8ceaf6c0e17146d508a776": [], "f2": [], "f2b75d2fc6f1a260f340f0e7c6a060f4dd2961cc16884ed851b0d18da06a": [], "f4": [], "f5": [], "f6": [], "f6bd1eee09314e7e6dee49cbe2c5e22314ccdb38db16c9fc72d2fa80d054": [], "f7": [], "f7e21b113dd48a9c97d364e0915b3988c6a0b6207652f5a92372871b7aa4": [], "f9": [], "f9d7fe80a8fcce9bb128d1381c6fe41a8d286d7e18395e273002e8e0fa34": [], "f_": [8, 11], "fa": [], "fabien": [0, 28], "face": [7, 13, 24, 25, 27], "facilit": 15, "fact": [8, 11], "factor": [0, 26], "fadernet": [], "fail": [2, 19, 23], "failur": 2, "fair": [20, 26], "fall": [8, 27], "fals": [4, 5, 10, 24, 26], "familiar": [4, 24], "fan": 4, "fandong": [], "fang": [], "fantast": 11, "far": [2, 4], "farid": [], "fashion": [9, 23], "fast": [0, 5], "fastapi": [], "fastcor": [], "faster": 19, "fatigu": 12, "favor": 2, "favour": 8, "fazeka": [0, 6, 8, 9, 15, 18, 28], "fb": [], "fc": [], "fd": [], "feasibl": 4, "featur": [0, 2, 4, 5, 7, 8, 9, 10, 11, 14, 17, 19, 21, 23, 24, 27, 28], "fed": 21, "federico": [], "fedu": [0, 18], "feed": 19, "feedback": [0, 3, 12, 18, 23, 25], "feel": 5, "felix": [0, 18], "femal": [4, 24, 28], "feng": [], "ferjad": [], "fernando": [0, 25], "few": [2, 4, 10, 11, 19, 26], "fewer": 26, "ff": [], "ff642e65ad6b90db43e668d70ffb6736436c7ce41fcc549f4e9472234127": [], "ffbf7a134b9ab11a67b0cf0726453cedd9c5043a4fe7a35d1cefa9a1bcfb": [], "ffmpy": [], "fid": [], "fidel": [0, 20], "fidler": [], "field": [3, 13, 15, 19, 20, 22, 28], "figsiz": 16, "figur": [13, 14, 16, 21], "file": [10, 24], "filelock": [], "filip": [0, 18, 25], "filippo": [], "fill": [19, 21, 22], "film": [7, 11, 21], "filter": [17, 27, 28], "filterwarn": [10, 16], "final": [4, 6, 7, 9, 11, 12, 13, 15, 19, 27], "find": [0, 4, 5, 7, 8, 9, 17, 19, 20, 23, 24, 26, 27], "fine": [0, 2, 5, 9, 11, 19, 28], "finetun": [0, 8, 11, 18], "finit": 21, "finnicki": 10, "fire": [], "firmli": 15, "first": [4, 6, 8, 9, 10, 11, 12, 13, 15, 17, 19, 21, 23, 24, 26, 27, 28], "fisch": [], "fischer": [], "fit": [2, 8, 11], "fit_transform": 16, "fix": [8, 9, 11, 17, 19, 21, 28], "fjeld": [], "flamingo": 19, "flash": [], "flashattent": [], "flat": 12, "flatten": 24, "flatten_dict": [], "flavio": [], "fleet": [], "flexibl": [3, 9, 15, 16, 19, 21, 22, 27, 28], "flexibli": 19, "float": 11, "float32": 10, "flore": 0, "florencia": [0, 18], "flori": 0, "florian": [], "flow": 0, "fltz10": [0, 17], "fm": 24, "fm22": [0, 11], "fma": [], "fn": 26, "focu": [2, 9, 11, 13, 15, 17, 23, 25, 28], "focus": [3, 6, 9, 13, 15, 16, 17, 18, 21, 26, 28], "folk": 24, "follow": [0, 2, 3, 4, 7, 8, 10, 11, 12, 13, 14, 17, 18, 19, 21, 26, 28], "fontsiz": 16, "fonttool": [], "foot": 22, "forc": 23, "foreign": [4, 24], "forget": 21, "forgo": 8, "fork": 10, "form": [2, 5, 6, 7, 8, 9, 11, 16, 19, 27], "formal": [2, 11, 23], "format": [4, 19, 22, 28], "formul": [7, 18, 23, 28], "forsgren": 0, "forth": [23, 24], "forum": [0, 8], "forward": [4, 11, 13, 14, 24], "fossez": [0, 18], "foster": 15, "found": [8, 19], "foundat": [0, 4, 8, 9, 15, 16, 18, 24, 28], "four": 26, "fourier": 0, "fp": 26, "fr": 0, "frac": [8, 26, 28], "fragkiadaki": [], "frame": 4, "framework": [3, 4, 8, 15, 18, 19, 22, 27, 28], "fran": 0, "franci": [], "francisco": 15, "francoi": 0, "frank": 0, "frechet": [], "freder": [], "fredo": [], "free": [0, 2, 4, 10, 24], "freedman": [], "freedom": [], "freeman": 0, "freeu": [], "freez": [4, 24], "freeze_backbone_model": 4, "freeze_parma": [4, 24], "french": 19, "fresh": 23, "fri": [], "from": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 28], "from_pretrain": [4, 24], "frontier": 19, "frozen": [8, 19], "frozenlist": [], "fsspec": [], "ftfy": [], "fu": [0, 17], "full": [2, 9, 13, 19, 23, 24], "fulli": [2, 5, 8, 10, 11, 20, 23], "function": [2, 4, 5, 8, 11, 13, 14, 24, 26], "functool": 10, "fundament": [17, 25, 26, 28], "furkan": 0, "further": [8, 9, 11, 13, 15, 18, 19, 22, 28], "furthermor": [16, 18, 28], "fuse": [], "fusion": [0, 28], "futga": [0, 5, 9], "futur": [3, 15, 17, 20, 21, 23, 28], "futurewarn": [4, 10, 24], "g": [0, 2, 4, 5, 8, 9, 10, 11, 19, 22, 25, 26, 27, 28], "g_": 11, "ga": 0, "gaa": [], "gabbolini": [0, 8, 9], "gabriel": [0, 18, 28], "gain": 28, "gal": [], "game": [], "gamma_": 11, "gamper": [], "gan": [0, 11, 19, 21], "ganguli": [], "ganti": [0, 18, 25, 28], "gao": [0, 9], "gap": [3, 23, 27, 28], "garcia": 0, "gardner": [0, 8, 9, 18], "gareth": 0, "gat": [0, 18], "gate": 11, "gaussian": 11, "gayoung": [], "gdsb23": [0, 8, 9, 16, 18], "ge": [0, 5, 8, 9], "geeta": [], "gef": [], "gemmek": [], "gen": [], "gener": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 20, 21, 22, 23, 25, 28], "generalis": 9, "generate_diffusion_cond": 10, "generated_text": 4, "genr": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "geoffroi": [0, 9], "geometr": [6, 28], "geon": [], "georg": [0, 6, 8, 15], "geq": [], "gerhard": [], "germain": 0, "german": 19, "gert": [0, 17, 18, 27], "gestin": [], "get": [6, 10, 11, 19], "get_device_nam": [4, 24], "get_ipython": 4, "get_item_vector_db": 24, "get_pretrained_model": 10, "get_query_embed": 24, "ggbe24": [], "ggdre22": [0, 13], "ggre21": [], "ghanem": [], "gharbi": [], "ghasemipour": [], "ghe22": [0, 8, 9], "gherardi": [], "ghosal": [0, 5], "gil": [], "gimelshein": [], "gin": [], "gin_config": [], "ginneken": [], "ginsburg": [], "giorgi": 0, "giorgio": 0, "giovanni": [0, 8, 9], "girish": [0, 28], "girl": 24, "git": 15, "git18": [], "gitdb": [], "github": [4, 10, 15, 24], "gitpython": [], "give": [5, 8, 9, 11, 19], "given": [2, 8, 9, 11, 12, 13, 17, 19, 21, 22, 26, 27], "gl83": [0, 12], "glasner": [], "global": [9, 11, 28], "glove": 28, "gltq23": [0, 9], "gmmp23": [], "go": [2, 9, 11, 13, 18, 19, 22, 24], "goal": [3, 7, 9, 11, 19, 28], "goe": 19, "goel": 0, "goh": [0, 28], "gokul": [], "golai": [], "gold": 6, "goldberg": [0, 8, 9], "golden": [], "gome": [], "gomez": [0, 5, 8, 9], "gone": 7, "gong": [], "gongfan": [], "gontijo": [], "good": [2, 11, 12, 19], "goodfellow": 0, "googl": [4, 13, 15, 24], "gordon": 0, "got": [4, 10, 24], "gouyon": [0, 28], "goyal": [], "gpt": [0, 4, 5, 8, 15, 18, 19, 22, 28], "gpt2": 4, "gpt2lmheadmodel": 4, "gpt2token": 4, "gpu": [4, 24], "grachten": 0, "gradient": [0, 2, 11, 22], "gradio": [], "gradio_cli": [], "gradual": 11, "grai": 21, "grain": [0, 2, 5, 9, 28], "gram": [6, 22], "grandios": 4, "granular": 27, "graph": [6, 19, 28], "grave": [0, 17], "great": [2, 4, 11, 20], "greater": [13, 28], "greatest": 16, "green": [0, 13, 16, 17, 21], "greenwood": 0, "greg": [], "gregori": [], "grew": 17, "griffin": 0, "gritsenko": [], "groh": [], "grosch": [0, 28], "gross": [], "ground": [4, 6, 24, 26], "groundwork": 17, "grounth": 6, "group": 0, "grow": [2, 25, 27], "grown": 13, "grpcio": [], "grug17": [], "gschwind": [], "gskp23": [0, 2, 13], "gt": 4, "gu": 0, "guanglu": [], "guangzhi": [0, 8], "guestrin": [], "gui": [], "guid": [0, 2, 5, 8, 13, 23, 25], "guidanc": [0, 2, 10, 15], "guitar": [2, 5, 10, 16, 24], "gulrajani": [0, 17], "gunjan": [], "guo": [0, 5, 8, 9], "guojun": [0, 17], "gupta": [], "gupta2023photorealisticvg": [], "guu": [0, 18], "gy": [0, 8, 9, 18, 28], "gy\u00f6rgi": [0, 9], "gz": [], "h": [0, 8, 11], "h11": [], "h5py": [], "h_audio": 24, "h_text": 24, "ha": [0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 23, 27, 28], "had": [11, 28], "hadjer": 0, "hai": [0, 6], "hall": 0, "hallaci": [0, 28], "hallucin": [19, 20], "han": 0, "hand": [7, 9, 19], "handl": [11, 16, 17, 19, 23, 25], "hang": [], "hani": [], "hann": [], "hantrakul": 0, "hao": [0, 5, 9], "haoh": [0, 18], "haoran": [], "haoxin": [], "haoyi": [0, 28], "happen": [4, 19], "happi": [5, 16, 27, 28], "hard": [2, 4, 7, 11, 19, 28], "harder": 9, "harm": [], "harmon": [2, 6], "hat": [8, 11], "have": [2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 22, 23, 24, 27, 28], "hawlei": 0, "hawthorn": 0, "hayk": 0, "hce": [], "he": [0, 9, 15], "head": [19, 28], "hear": [0, 7, 8], "heard": 5, "heart": 24, "heavi": [5, 28], "heewoo": [0, 18], "heiga": [0, 17], "height": 11, "helen": [], "helena": [0, 5, 8, 9], "hellsten": [], "help": [0, 3, 6, 12, 17, 19, 26, 28], "hennequin": [0, 8, 9], "her": [4, 15, 24], "herald": 15, "herbert": [], "herd": [], "here": [2, 7, 10, 11, 21, 24], "herman": [], "herreman": [0, 5], "herrera": [0, 9], "hershei": [], "hertz": [], "hertzmann": [], "hesit": 3, "heusel": 0, "hf_token": 10, "hh": [], "hhl": [0, 9], "hhy": [], "hi": 15, "hi79": [0, 13], "hidden": [11, 21], "hierarch": 0, "high": [0, 2, 8, 9, 11, 12, 16, 17, 19, 27, 28], "higher": [6, 7, 11, 13, 26], "highest": 26, "highli": 12, "highlight": [2, 4, 7, 10, 15, 23, 25], "hila": 0, "hiller": 0, "hilton": [], "hing": 28, "hint": 19, "hiromi": [0, 6], "hirsh": [], "histor": [21, 23], "histori": [9, 22, 23, 25], "hit": 11, "hja20": [0, 13], "hjc": [], "hjl": [0, 16, 18, 28], "hla": [], "hlss23": [0, 5, 8], "hmt": [], "ho": 0, "hochreit": 0, "hoffman": 0, "holger": [0, 28], "holist": [0, 19], "holoview": [], "holynski": [], "hongsheng": [], "hook": 24, "hope": 3, "horac": [], "hot": [16, 17, 18, 19], "hotel": 28, "hou": [0, 18], "how": [0, 2, 3, 4, 6, 7, 8, 9, 11, 15, 17, 18, 19, 21, 22, 23, 25, 26, 28], "howev": [8, 9, 11, 12, 15, 17, 19, 21, 23, 25, 27, 28], "hpn17": [0, 13], "hpw": [], "hru": [0, 12], "hs21": [], "hsg": [], "hsiang": [0, 6], "hsiao": 0, "hsin": 0, "hsr": [], "hsuan": [0, 17, 18], "ht": 28, "html": 5, "http": [0, 4, 5, 6, 8, 9, 10, 15, 24], "httpcore": [], "httpx": [], "hu": [], "huam": [0, 8], "huang": [0, 5, 8, 9, 18, 28], "hub": [], "hubert": 0, "hug": 24, "huge": [16, 19, 20, 21, 22], "huggingfac": [4, 10, 24], "huggingface_hub": 10, "hugo": 0, "hui": [0, 28], "huiwen": 0, "hum": 5, "human": [0, 6, 7, 9, 13, 15, 17, 18, 20, 23, 25, 27, 28], "humphrei": [], "hundr": 19, "hussain": [0, 5, 8], "hvu": [0, 13], "hy20": [0, 13], "hybrid": 8, "hyelin": [], "hyper": 2, "hyperparamet": 28, "hyung": [0, 18], "hyungjin": [], "hzrs16": [], "i": [0, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], "ian": 0, "icassp": [0, 5, 8, 9, 15, 18, 28], "icassp48485": [0, 5, 8], "iccv": 0, "iclr": [0, 17], "icml": [0, 15, 18], "id": [0, 5, 8], "idea": [11, 16, 19, 28], "ideal": 19, "ident": 11, "identifi": [12, 19, 26, 28], "idf": 6, "idna": [], "ieee": [0, 5, 8, 9, 15, 17, 18, 27, 28], "ieeexplor": [0, 9], "iffus": [], "ignor": [10, 16, 23], "ignore_index": 4, "ijcai": [0, 8], "ijcnn": [0, 8, 9, 18], "ijcnn54540": [0, 9], "ikemiya": 0, "il": [], "ilaria": [0, 3, 5, 6, 8, 9, 15, 18, 23, 28], "ilg": [0, 18], "illia": 0, "illustr": [13, 14, 18, 19], "ilya": [0, 18], "imag": [0, 6, 8, 11, 12, 16, 17, 19, 20], "imagegpt": 19, "imageio": [], "imagin": [2, 11, 24], "imbu": 11, "immers": [], "impact": [8, 16, 20], "imperi": 15, "implement": [8, 20, 22, 24, 28], "implicit": 23, "implicitli": 11, "import": [3, 4, 5, 10, 11, 16, 19, 23, 24, 25, 26, 28], "importlib": [], "importlib_resourc": [], "impract": 27, "impress": [13, 24], "improv": [0, 4, 13, 19, 23, 24, 25, 26, 28], "inabl": [2, 23, 25], "inaccur": 19, "inbar": [], "inc": 0, "includ": [4, 6, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 22, 26], "inclus": [12, 15], "incorpor": [4, 9, 13, 14, 18, 19, 23, 25, 28], "incorrect": 19, "incorrectli": 26, "increas": [19, 20, 27, 28], "increasingli": [3, 7], "incredibli": 24, "independ": [23, 25], "index": [4, 11, 24], "indi": 5, "indian": [4, 24], "indic": [12, 19, 21, 24, 26, 27], "individu": 28, "indulg": 15, "infer": [0, 2, 10, 16, 18, 19, 23, 27], "infinit": [], "influenc": [2, 8, 18, 24, 28], "influenti": [21, 22], "info": [], "infonc": 28, "inform": [0, 2, 4, 6, 8, 9, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27], "informat": [4, 24], "infus": [4, 24], "inher": 23, "init": 2, "init_temperatur": 24, "initi": [4, 8, 11, 13, 17, 23, 27], "initialis": 8, "inlin": [], "innov": [13, 15, 20, 28], "inpaint": 2, "input": [2, 4, 8, 9, 11, 12, 13, 14, 17, 21, 22, 28], "input_id": [4, 24], "inputs_emb": 4, "insid": [11, 21], "insight": 12, "inspir": [4, 13], "instal": [4, 10, 15], "instanc": [21, 28], "instead": [5, 6, 7, 8, 9, 11, 19, 21, 23], "institut": [0, 17, 20, 27], "instruct": [0, 5, 6, 9, 18], "instruct_2023": [], "instructgpt": 19, "instructpix2pix": [], "instrument": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "int": [4, 24], "int16": 10, "integ": 21, "integr": [11, 12, 13, 18, 19, 25], "intellig": [0, 8, 15, 19, 28], "intend": [13, 22], "intens": 2, "intent": [0, 16, 19, 23, 25], "interact": [9, 13, 15, 19, 20, 23, 25, 28], "interest": [2, 7, 15, 17, 22, 23, 24, 25], "interestingli": 19, "interfac": [8, 15, 25], "interleav": [11, 19], "intermedi": [0, 8], "intern": [0, 5, 6, 8, 9, 11, 15, 17, 18, 19, 20, 25, 28], "internationaltunion01": [], "internet": [20, 28], "interpret": 19, "interspeech": 0, "interv": [], "intervent": 0, "intric": 4, "intro": 11, "introduc": [3, 11, 13, 14, 18, 19, 21, 25, 28], "introduct": 18, "intuit": [2, 16, 19, 25, 27], "invalu": 12, "invers": [0, 2], "invert": [], "invertib": 2, "investig": [3, 18], "invok": 19, "involv": [10, 11, 17, 19, 26], "io": [], "ipd": 10, "ipython": [4, 5, 10, 24], "iqbal": 0, "irani": [], "iren": 0, "irrelev": [26, 28], "is_avail": [4, 10, 24], "isaacson": 0, "isca": 0, "ish": 4, "ishaan": [0, 17], "ismir": [0, 8, 9, 11, 15, 17, 18, 28], "ismir2008": 17, "ismir2019": 17, "ismir2021": 17, "isn": [5, 6, 19, 23], "isola": [], "isotrop": 11, "issn": [0, 9], "issu": [4, 6, 10, 23, 24], "itai": [0, 18], "item": [0, 4, 5, 8, 9, 16, 18, 24, 25, 26], "item_joint_embed": 24, "item_vector_db": 24, "iter": [2, 8, 23], "its": [2, 4, 7, 8, 9, 12, 14, 15, 16, 17, 19, 20, 22, 23, 27, 28], "itself": [2, 5, 9, 11, 19], "itu": 0, "izze17": [], "j": [0, 18, 28], "jaakko": [], "jaakkola": [], "jack": [0, 28], "jacob": [0, 18, 28], "jacquelin": [0, 9], "jade": [0, 18], "jae": 0, "jaegl": [], "jaejun": [], "jaesik": [], "jaewoong": [], "jai": [], "jain": [0, 17], "jakob": 0, "jamendo": 5, "jampani": [], "jan": [0, 5], "janko": [0, 18], "jann": 0, "janner": [], "jansen": [0, 5, 18, 28], "jargon": 19, "jasa": [], "jascha": 0, "jasco": 2, "jason": [0, 18], "jauhri": [], "javier": 0, "jayasumana": [], "jazz": [27, 28], "je": [], "jedi": [], "jeffrei": [0, 9, 18], "jen": [], "jeong": [0, 18, 28], "jeongsol": [], "jess": [0, 5, 17, 18], "jessica": [], "ji": [], "jiacheng": [], "jiaheng": [0, 5, 9], "jiahui": 0, "jiaji": [], "jiajia": [0, 6], "jiam": [], "jian": [], "jianbin": [], "jianfei": [], "jiang": [0, 18], "jianglong": [], "jianlin": [], "jianmin": [], "jianxin": [0, 28], "jiawei": [], "jiayi": [], "jie": [], "jimmi": [], "jin": [0, 23], "jinan": [], "jinbo": [], "jing": [], "jinglin": [], "jingren": [], "jingwei": 0, "jinja2": [], "jiong": [], "jitong": 0, "jiwen": [], "jiyoung": [0, 18, 28], "jmespath": [], "jnmr": [0, 9], "joao": [], "joar": [], "job": 20, "joblib": [], "john": 0, "join": [4, 8, 24], "joint": [0, 2, 3, 8, 9, 18, 23, 25], "joint_dim": 24, "jointembeddingmodel": 24, "jointli": 28, "jona": [], "jonah": 0, "jonathan": 0, "jone": 0, "jong": [0, 15, 18, 28], "jongmin": [], "jongpil": [0, 8, 9, 17, 18, 28], "jongwook": 3, "joon": [], "joonseok": [0, 18, 28], "jooyoung": [], "jordi": 0, "jort": [], "jose": [0, 17], "josef": [0, 5], "joseph": [], "josh": [0, 8, 9, 18], "joshua": [], "josiah": 0, "journal": [0, 9, 17, 18, 23], "journei": 17, "joy": 5, "jrv": [], "jsonmerg": [], "jsonschema": [], "ju": 0, "juan": 0, "judg": 6, "judgement": 6, "judith": [0, 18, 28], "juhan": [0, 8, 9, 15, 17, 18, 23, 25, 28], "juho": [], "jukebox": [0, 13, 15, 18], "jukedrumm": [], "julian": [0, 5, 9, 15, 17, 18], "julio": [], "juliu": [], "jump": 11, "jun": [0, 18], "junbo": 0, "junda": [0, 5, 9], "june": [0, 5, 8, 9], "junghyun": 0, "junho": [], "junyan": 0, "jupyt": 15, "just": [10, 11, 12, 19, 20, 21, 23, 24, 25, 27], "justin": [0, 5, 28], "k": [8, 9, 14, 24, 26], "k_diffus": [], "kaal22": [], "kadian": [], "kai": [0, 17], "kaim": [], "kaiser": 0, "kaist": 15, "kakao": 15, "kal": [], "kalambarkar": [], "kalchbrenn": [0, 17], "kamko": [], "kamyar": [], "kang": [], "kant": [0, 18], "kao": [], "karagol": [], "karan": 0, "karen": [0, 17], "karra": [], "karsten": [], "karunratanakul": [], "kastner": [], "katarina": [0, 18], "kate": [0, 8], "katerina": 0, "katharopoulo": 0, "katherin": [0, 18], "kavukcuoglu": [0, 17], "kawar": [], "kazuhito": [], "kb": [], "kb14": [], "kbockw15": [], "kci": [0, 12], "ke": [0, 3, 5, 8, 15, 18, 19, 23, 28], "keep": 24, "keepdim": 16, "kei": [3, 7, 8, 9, 10, 12, 13, 14, 17, 18, 23, 26, 27, 28], "keji": [], "kelvin": [0, 18], "kenton": [0, 18, 28], "kept": 8, "keqiang": [], "keren": [], "keunwoo": [0, 8, 9, 17, 18, 23, 25, 28], "kevin": [0, 5, 8, 9], "kexin": [], "keyword": [0, 13, 28], "kfir": [], "kharitonov": [], "khurana": 0, "khz": 11, "ki": [], "kilgour": 0, "kilian": [], "kim": [0, 9, 15, 18, 23, 25, 28], "kind": [5, 9, 19, 21], "kingma": 0, "kirchhoff": [0, 28], "kirel": [], "kirkpatrick": [0, 8, 15, 18, 28], "kirsch": [], "kiwisolv": [], "kjz24": [], "kkdb": [], "kkkm23": [], "kl": [11, 12], "klaski": [], "kll": [0, 11], "knife": [0, 6], "knob": 10, "knolwedg": 4, "know": [9, 21], "knowledg": [0, 4, 8, 9, 16, 19, 20, 21, 24, 28], "known": [10, 27, 28], "koh": [], "kohler": [], "koichi": [], "koishida": [], "kong": 0, "konpat": [], "koo": 0, "korai": [0, 17], "kornia": [], "kornia_r": [], "korraw": [], "korzeniowski": [], "kosta": 0, "kostrikov": [], "kozareva": [0, 8, 9], "kpa": [], "kph": [], "kpschonfeld": [], "krasheninnikov": [], "kreb": [], "krei": [], "kreuk": [0, 18], "kristina": [0, 9, 18, 28], "krisztian": [0, 18, 25], "krueger": [], "kshiteej": [], "ksl": [0, 11], "ksm": [0, 9], "ksp": [0, 13], "ku": [], "kullback": 12, "kumar": [0, 17], "kundan": [0, 17], "kuznetsov": 0, "kw13": [], "kwak": [], "kwan": [], "kwg": [0, 2], "kwon": [0, 23, 25], "kyle": [], "kynk": [], "kynkaanniemiak": [], "kyunghyun": [0, 9], "kzb": [], "kzl": [], "kzrs18": [], "kzrs19": [0, 12], "kzz": [], "l": [6, 8, 9, 28], "l1": 14, "l177": 10, "l2": 14, "lab": 15, "label": [0, 4, 7, 9, 12, 17, 19, 21, 24, 26, 27], "lack": [2, 11, 19, 23, 28], "lai": 0, "laid": 17, "lain": [], "laion": [], "laion_clap": [], "lala": [], "lam": [], "lam08": [0, 17], "lama": [0, 18], "lamer": [0, 17], "lamtharn": 0, "lanckriet": [0, 17, 18, 27], "land": [], "lang": 0, "langaug": [24, 28], "languag": [0, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 20, 23, 24, 25, 27, 28], "lanzendarferppw": [], "lanzendarferppw24": [], "lanzendorf": [0, 8], "lanzendorfer_blap_nod": [], "lanzend\u00e3": [], "larg": [0, 4, 5, 6, 7, 8, 11, 18, 19, 20, 23, 24, 25, 26, 27, 28], "larger": [4, 19, 24], "larson": [0, 8], "last": [4, 7, 10, 16, 19, 24], "last_hidden_st": 24, "lastli": [20, 21], "late": [0, 8], "latenc": [19, 20], "latent": [0, 2, 8, 11, 13, 14, 24, 28], "later": [8, 11, 13, 19, 22], "latest": [7, 15, 22], "latin": 20, "latter": 9, "lattner": 0, "lau": [], "launch": 15, "laura": [], "laurent": [], "laurier": [0, 17], "lav": [], "lawrenc": [], "layer": [0, 8, 11, 19, 21, 27, 28], "lazi": [], "lazo": [], "lazy_load": [], "lcy": [0, 11, 13], "ldm": 0, "ldot": [8, 9, 26], "le": [0, 18], "leach": [], "lead": [6, 12, 13, 19, 23, 25, 27, 28], "learn": [0, 2, 3, 4, 7, 8, 9, 11, 12, 13, 15, 17, 18, 21, 22, 23, 25, 26, 27], "learnabl": [8, 28], "learner": [0, 4, 16, 18], "learnt": 4, "least": [6, 11, 26, 28], "leav": 23, "led": 13, "lee": [0, 8, 9, 17, 18, 23, 28], "lee10": [0, 23], "leemput": [], "left": [8, 11, 19], "legend": 16, "legg": 0, "lehtinen": [], "lei": [], "leibler": 12, "lejaren": 0, "lejun": 0, "len": [4, 11, 16, 24], "length": [5, 6, 9, 10, 11, 19, 20], "lenient": 26, "leo": 0, "leonard": 0, "leoni": [0, 18], "lerman": [0, 9], "less": [12, 19, 20, 26], "lester": [0, 18], "leszczynski": [0, 18, 25], "let": [2, 6, 8, 9, 16, 19, 20, 21, 28], "letman": [], "letter": [0, 9, 19], "level": [0, 2, 6, 7, 8, 9, 11, 12, 16, 22, 27, 28], "leverag": [2, 3, 4, 8, 12, 13, 14, 16, 19, 25], "levi": 0, "levin": [], "lexic": 6, "lezcano": [], "lgw": [0, 2], "lhss24": [0, 5, 8], "li": [0, 6, 8, 9, 12, 18, 28], "liang": 0, "liao": [0, 6], "lib": [4, 10, 24], "librari": [], "librosa": 10, "licens": [4, 5], "lick": 5, "lifeng": [], "light": 8, "lightn": [], "lightning_util": [], "like": [2, 5, 6, 8, 9, 10, 11, 13, 15, 16, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28], "likelihood": 17, "lim": 0, "limit": [0, 3, 9, 13, 16, 17, 18, 20, 21, 22, 27, 28], "lin": [0, 9], "lina": [], "linalg": 16, "line": [4, 8, 10, 16, 19, 21, 24, 28], "line2d": 16, "linear": [4, 6, 8, 11, 24], "lingelbach": [], "linguist": [0, 5, 8, 9], "link": 4, "linkifi": [], "linmiao": [], "linux": 10, "lior": [0, 17], "lipman": [], "list": [4, 11, 19, 24, 26], "listen": [0, 4, 5, 10, 17, 24], "literatur": [6, 9, 19], "liu": [0, 5, 8, 9, 18, 24, 28], "liu19": [0, 28], "liu_music_2024": [], "live": [5, 21], "liwei": [], "lka": [], "lkopf": [], "ll": [2, 4, 8, 10, 11, 21, 22], "llama": [0, 5, 8, 19], "llark": [0, 8, 9, 18], "ller": [], "llion": 0, "llm": [0, 3, 5, 6, 18, 19, 20, 28], "llvmlite": [], "lm": [2, 11, 13, 18], "lmn23": [], "lmnt": [], "lmz": [], "ln17": [0, 9], "load": [10, 24], "load_dataset": [4, 5, 24], "loader": [], "local": [0, 4, 10, 24], "local_attent": [], "localis": 9, "locat": 21, "lockhart": [], "log": [11, 19, 24, 28], "logit": [4, 17, 19, 24, 27], "logit_scal": 24, "loi": 0, "london": 15, "long": [0, 9, 11, 12, 17, 18, 19, 21, 22, 24, 27], "longbo": [], "longer": [6, 9, 11, 19], "longest": [4, 6, 24], "longpr": [0, 18], "look": [7, 9, 11, 18, 19, 21, 23, 26, 27], "looper": 2, "lope": [], "lorenz": [], "loss": [3, 4, 14, 19, 23, 24, 25], "loss_a2t": 24, "loss_t2a": 24, "lost": [], "lot": [11, 16, 19, 21], "loui": [], "low": [2, 12, 16, 27], "lower": [12, 19, 26], "lowest": 19, "lp": [0, 4, 5, 8, 18, 24], "lpg": [], "lppw24": [0, 8], "lr": [4, 24], "lsp": [0, 9], "lstm": 22, "ltgm19": [], "lth": [], "ltl": [], "ltu": 8, "lu": [0, 6, 8, 9, 17], "luan": [0, 18], "luca": [0, 8], "luckili": 2, "lueb": 0, "luk": [], "lukasz": 0, "luke": [0, 17, 18, 27], "lun": [0, 18], "lunch": [], "luo": [], "luong": [], "lupe": [], "lvmin": [], "lxjz23": [], "ly": [], "lyl": [0, 11], "lyric": [9, 16, 23], "lyt": [0, 6], "lzb": [], "lzg": [0, 25], "m": [0, 8, 9, 15, 18], "m1": 10, "m2ugen": [0, 5, 8], "ma": [0, 5, 8, 9], "maarten": [0, 18], "mac": 10, "mach": 0, "machan": 8, "machin": [0, 6, 7, 8, 13, 15, 17, 18, 19, 21, 28], "maciej": [], "macosx_10_10_x86_64": [], "macosx_10_12_x86_64": [], "macosx_10_13_x86_64": [], "macosx_10_15_universal2": [], "macosx_10_15_x86_64": [], "macosx_10_5_x86_64": [], "macosx_10_6_intel": [], "macosx_10_9_intel": [], "macosx_10_9_universal2": [], "macosx_10_9_x86_64": [], "macosx_11_0_arm64": [], "macosx_14_0_x86_64": [], "madmom": [], "maestro": [0, 6], "magazin": [0, 17, 18], "magenta": 13, "magnatagatun": [4, 5, 24, 28], "magnatagtun": [], "magnitud": [], "maher": [], "maheswaranathan": [], "mahieux": [0, 17], "mai": [2, 5, 9, 12, 13, 19, 21, 23, 25], "main": [0, 2, 5, 7, 8, 9, 10, 11, 20, 21, 27], "maintain": [23, 25, 27], "major": [6, 20, 21, 23, 28], "majumd": [0, 5], "make": [2, 3, 4, 5, 8, 9, 10, 19, 20, 21, 23, 25, 27, 28], "male": 5, "malici": 20, "malinowski": [], "manag": [20, 27], "manco": [0, 5, 6, 8, 9, 15, 18, 23, 28], "mancusi": 0, "mandic": 0, "maneesh": [], "mani": [2, 5, 8, 9, 11, 16, 20, 21, 22, 23, 24, 28], "manifold": 16, "manilow": 0, "mannies": [], "manoj": [], "manor": 0, "manual": 11, "mao": [0, 6], "map": [2, 4, 8, 9, 11, 13, 19, 21, 26, 28], "marc": 0, "marcel": [], "marco": [0, 5, 18], "mard": 5, "margin": [11, 28], "mari": 15, "mariani": 0, "marianna": [0, 18], "marini": [], "mario": [], "mark": [0, 8, 9, 13, 21], "markdown": [], "markdown2": [], "marker": 16, "markerfacecolor": 16, "markers": 16, "markov": 19, "markupsaf": [], "mart": 0, "martin": 0, "martiro": 0, "marvin": [], "mask": [0, 2, 4, 13, 18, 22, 28], "maskgit": 0, "massachusett": [0, 17, 27], "massiv": [0, 6, 21], "masterpiec": 4, "match": [12, 19, 21, 26, 27, 28], "matena": [0, 18], "materi": 15, "mateusz": [], "math": [11, 19], "mathbb": 11, "mathbf": 11, "mathcal": [9, 11, 28], "mathemat": [19, 22], "mathew": [], "mathews1969technologi": [], "mathrm": 11, "mathur": [], "matplotlib": 16, "matrix": 12, "matt": 0, "matthew": 0, "matthia": [], "mauricio": 0, "mauro": [0, 5, 18], "max": [10, 24, 28], "max_length": [4, 24], "maxim": [19, 28], "mayb": 2, "mb": [], "mbl10": [], "mbqf21": [0, 8, 9, 18], "mbqf22": [18, 28], "mbqf22a": 0, "mbqf22b": [0, 16], "mc": 24, "mcaulei": [0, 5, 9, 15, 17, 18], "mccann": [], "mcfee": [], "mckee": [0, 5], "mcleavei": [], "mcy": [], "mdit": [], "mdurl": [], "me": [24, 28], "me14": [], "mean": [0, 2, 5, 6, 8, 9, 16, 17, 19, 24, 26, 27, 28], "meaning": [6, 27, 28], "measur": [4, 6, 12, 24, 25, 26, 28], "mechan": [2, 4, 8, 14, 21, 25, 27, 28], "media": [4, 11], "median": 26, "medic": [], "medium": 2, "meet": 26, "megan": [0, 18, 25], "mehri": [0, 17], "mehrish": [], "mei": 0, "meinard": [], "mel": [11, 14, 21], "melanchol": [24, 28], "melechovski": [0, 5], "melgan": [], "melod": 13, "melodi": [2, 4, 5, 13], "member": 15, "memcnn": [], "memo": [], "memori": [], "meng": [], "menghan": [], "mengji": [0, 6], "mengy": [], "menick": [], "mention": 16, "mert": [4, 28], "mesmer": [4, 24], "meta_db": 24, "metadata": [2, 16, 23, 24, 27, 28], "metal": 28, "meteor": 6, "meter": [], "method": [0, 2, 3, 8, 9, 12, 14, 16, 17, 18, 19, 21, 22, 28], "methodologi": [3, 15, 17, 18], "metric": [0, 3, 12, 18, 24, 26], "metzler": [], "mexico": [0, 5, 8, 9], "mfmw24": [], "mgg": [0, 2, 5], "mha": [], "mi": [], "miccai": [], "micha": 0, "michael": [0, 18], "michal": [], "micha\u00ebl": [], "michel": 0, "michigan": 15, "micro": [], "midi": [], "midinet": [0, 13], "might": [2, 4, 13, 19, 28], "migneco": [0, 9], "mihir": [], "miika": [], "mike": [], "mikhail": [], "mildenhal": [], "miller": [0, 17], "million": [5, 19], "mimic": 8, "min": [0, 24], "ming": [0, 17, 18, 28], "mingbo": [], "minghui": [], "mingni": [0, 6], "minguk": [], "mingz": [], "mini": [19, 28], "minim": [11, 19, 28], "minimum": 28, "minor": 10, "minz": [0, 4, 5, 9, 18, 23, 24, 28], "mir": [2, 3, 4, 9, 15, 16, 24], "mir_ev": [], "mishkin": [0, 18, 28], "misinform": 20, "mislead": 12, "mismatch": 2, "miss": [0, 3, 23, 25, 26], "mission": 20, "mit": [5, 17], "mitsubishi": 15, "mitsufuji": [0, 6], "mix": [6, 8], "mixtur": 8, "mixup": [0, 18], "mjxz23": [0, 13], "mkg": [0, 13, 17], "mlm": 21, "mlp": 11, "mlvalimaki23": [], "mlx": [], "mm24": [0, 2], "mmm": [], "mo": [], "modal": [0, 5, 8, 12, 13, 15, 19], "mode": [2, 19], "model": [0, 2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 20, 23, 25, 27], "model_config": 10, "moder": 11, "modern": [11, 21, 23, 24], "modifi": [0, 8, 23], "modul": [2, 4, 10, 11, 12, 14, 16, 24], "modulenotfounderror": [4, 10, 16, 24], "moham": [0, 17], "mohammad": [0, 17], "mojtaba": 0, "mokadi": [], "mold": 2, "molei": [], "molin": [], "monica": 0, "monitor": [0, 20], "mood": [2, 5, 9, 16, 17, 23, 27, 28], "moon": [], "moonseok": [], "moor": [], "mor": [0, 17], "more": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28], "morgan": [], "morph": 2, "morton": [0, 9], "mosseri": [], "most": [2, 4, 7, 8, 10, 11, 12, 16, 17, 19, 21, 22, 24, 28], "mostafa": [0, 18], "mostli": 23, "motion": [], "motiv": 11, "move": [4, 25], "mpmath": [], "mpt": 5, "mqa": [6, 8, 9], "mrr": 26, "msci": 15, "msd": 28, "msdm": 2, "msn24": [0, 28], "mssr23": [0, 5], "mtc": [], "mtg": 5, "mtp": [0, 2], "mtt": [4, 24], "mu": 5, "mucap": 5, "much": [6, 8, 9, 10, 11, 21, 23], "muchomus": [0, 6], "muedit": 5, "muhammad": [], "mul": 10, "mulab": [4, 15, 24], "mulan": [0, 16, 18, 28], "mulap": 16, "mullama": [8, 9], "muller15": [], "multi": [0, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 23, 25], "multiclass": 27, "multidict": [], "multilabel": [27, 28], "multimedia": [0, 17, 28], "multimod": [0, 3, 6, 9, 15, 18, 20, 28], "multipart": [], "multipl": [0, 6, 8, 9, 11, 16, 18, 19, 20, 21, 23, 25, 26, 27, 28], "multitask": [0, 4, 18], "multitrack": [0, 13], "murata": 0, "murtadha": [], "muscap": [0, 8, 9, 18], "muse": [], "musegan": [0, 13], "musemorphos": [], "music": [0, 3, 4, 5, 6, 12, 13, 14, 16, 19, 21, 22, 23, 24, 25, 26, 27, 28], "musicbench": 5, "musiccap": [0, 4, 5, 8, 18, 24], "musiccaptioningmodel": 4, "musicfm": [4, 24], "musicfm_emb": [4, 24], "musicgen": 13, "musicgenerationtempl": [], "musichifi": 0, "musician": [2, 5], "musicinstruct": 5, "musicldm": [0, 11, 13, 18], "musiclm": [0, 5, 13, 18], "musicmagu": [0, 2], "musicqa": 5, "musictextclip": 5, "musictextdataset": [4, 24], "musicva": 13, "musika": [], "musilingo": [0, 5, 8, 9], "must": [4, 23, 25, 28], "mustango": [0, 2, 5], "mvdream": [], "mwd": [0, 5, 23], "mwpt18": [], "mwpt19": [0, 17], "my": [], "n": [0, 4, 5, 6, 8, 10, 11, 18, 22, 24, 28], "n_compon": 16, "n_step": 10, "naacl": [0, 5, 8, 9], "nabla_": 11, "naeem": [], "nah": [], "naik": [], "nal": [0, 17], "nam": [0, 8, 9, 15, 17, 18, 23, 25, 28], "namburi": [0, 5, 9], "name": [4, 8, 10, 13, 15, 16, 24, 27, 28], "nameerror": 4, "nan": [0, 18], "nanxin": [], "naoki": 0, "narang": [0, 18], "narrow": 23, "nash": 0, "natalia": [], "nataniel": [], "nathan": [0, 8], "nathana\u00e3": [], "nation": 20, "nativ": [], "nattanat": [], "natur": [0, 2, 4, 5, 7, 8, 9, 13, 17, 18, 19, 22, 23, 24, 25, 27, 28], "navercorp": 15, "navig": 7, "navonil": [0, 5], "na\u00efv": 21, "nc": [5, 24], "ncl": [0, 17, 18], "ncsoft": 15, "nd": 5, "nearli": 11, "necessari": 26, "necessarili": [12, 19], "necessit": 20, "need": [0, 4, 8, 10, 12, 15, 16, 17, 19, 20, 21, 23, 24, 25, 26, 27], "neg": 26, "neil": [], "nessler": 0, "net": [0, 8, 11], "network": [0, 8, 9, 13, 14, 17, 18, 19, 21, 22, 28], "networkx": [], "neuraip": [], "neural": [0, 8, 9, 12, 13, 17, 18, 19, 21, 22, 28], "neurip": [0, 15], "neurocomput": [], "never": 28, "new": [0, 2, 4, 9, 11, 13, 14, 15, 17, 18, 19, 23, 24, 27, 28], "newcom": 3, "newer": [6, 8], "newman": [], "next": [4, 8, 9, 10, 11, 14, 15, 19, 21, 22], "nez": 0, "nezhurina": [0, 18], "nfeld": [], "nice": 11, "nichol": 0, "nichola": [0, 18], "nick": [], "nickson": 0, "nicola": [], "nie": [], "nielsen": 0, "nieto": [0, 28], "night": 24, "nikhil": [], "niki": 0, "nikita": [0, 8], "nikolau": [], "niru": [], "nistal": 0, "nlp": [3, 6], "nm": 24, "nmbkb24": [2, 18], "nmbkb24a": [0, 2, 11], "nmbkb24b": 0, "nn": [4, 24], "nniemi": [], "no_grad": [4, 16, 24], "noam": [0, 17, 18], "nois": 11, "noise2": [], "noisi": [11, 16, 28], "non": [6, 8, 11, 13, 20], "none": 4, "nonequilibrium": [], "nonsens": 19, "norberto": [], "norm": [11, 16], "normal": [10, 11, 24, 26, 27], "norman": [], "norouzi": [0, 17], "notabl": [2, 11, 12, 13, 23], "notat": 11, "note": [0, 2, 4, 5, 8, 10, 11, 13, 17, 21], "notebook": [4, 10, 15, 24], "nou": [], "nouri": [], "nov": [], "novack": [0, 5, 9, 15, 18], "novack2024prestod": [], "novel": [3, 16, 18, 27, 28], "novelti": [0, 18], "novemb": 15, "now": [2, 4, 8, 9, 10, 11, 15, 19, 20, 21, 24], "np": [16, 24], "npa": [0, 2], "nsynth": [13, 17], "nuanc": [9, 16, 25, 28], "null": [], "num_epoch": [4, 24], "num_return_sequ": 4, "num_work": 24, "numba": [], "number": [2, 6, 10, 11, 12, 21, 27], "numel": [4, 24], "numer": 11, "numpi": [10, 16, 24], "nvp": [], "nzc": [0, 11], "o": [10, 16], "o1": 19, "oasi": 27, "object": [6, 12, 14, 19, 28], "obtain": [5, 6, 8, 9, 12, 15, 19], "occupi": 7, "occur": 11, "octob": [0, 6, 8], "od": 0, "off": [9, 11], "offer": [11, 12, 15, 18, 25, 28], "often": [2, 5, 8, 9, 12, 16, 19, 20, 22, 23, 28], "oh": [], "oi": 0, "olaf": [], "older": 2, "olivi": [], "olv18": [0, 28], "omer": [], "ommer": [], "omran": [], "onc": [10, 11, 19], "one": [2, 5, 8, 9, 11, 14, 16, 17, 18, 19, 20, 21, 22, 26, 27, 28], "ones": [4, 21], "ongo": 20, "onli": [2, 4, 5, 6, 8, 9, 11, 15, 17, 19, 21, 25, 26, 27, 28], "onlin": [15, 24], "onto": 21, "ontologi": [], "ontrol": [], "oor": 0, "oord": [0, 17, 28], "oov": 27, "open": [0, 2, 6, 11, 19, 20, 23, 27, 28], "openai": [0, 13, 15, 18, 19], "openli": 19, "openmu": [0, 6], "openreview": [0, 8], "oper": [2, 3, 11, 23, 27], "opera": [4, 24], "operat": 4, "operatornam": 8, "opportun": [23, 25], "optim": [0, 2, 4, 18, 19, 22, 24, 28], "option": [8, 10, 19], "orama": [0, 28], "oran": [], "orchestr": 4, "orchestra": 4, "order": [4, 8, 9, 11, 21], "org": [0, 5, 6, 8, 9], "organ": [0, 8], "organis": 7, "orient": [], "origin": [2, 4, 6, 11, 12, 13, 19, 21, 24], "orio": [], "oriol": [0, 17, 28], "orjson": [], "orthogon": [16, 19], "oscar": [0, 17], "other": [0, 2, 4, 5, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 28], "otherwis": 19, "our": [3, 8, 10, 11, 15, 19, 27], "out": [2, 3, 4, 8, 10, 11, 16, 18, 21, 24, 28], "outlier": 26, "output": [2, 4, 6, 8, 9, 10, 11, 12, 13, 17, 21, 22, 27, 28], "outsid": [15, 27], "ouyang": [0, 18], "over": [8, 11, 17, 21, 22, 25, 28], "overal": [2, 4, 6, 7, 8, 10, 11, 12, 19, 26], "overcom": 3, "overhead": 27, "overlap": 6, "overli": 21, "overview": [3, 8, 15, 22], "owj": [0, 18], "own": [4, 20], "p": [0, 4, 8, 9, 11, 17, 22, 24], "p310": [10, 15], "p_": [8, 11, 22], "p_0": 11, "p_1": 6, "p_i": [], "p_n": 6, "p_t": 11, "pablo": 0, "pachet": 0, "packag": [4, 10, 24], "pad": [4, 10, 24], "pad_token": 4, "pad_token_id": 4, "page": [3, 15], "pai": 11, "pair": [2, 4, 5, 7, 8, 12, 16, 19, 21, 28], "palett": [], "pamela": [0, 18, 28], "pan": [], "panda": 5, "pandei": [], "pandora": 15, "panel": [], "pann": [0, 12], "panorama": [], "paper": [15, 16, 19, 21], "paradigm": [4, 7, 8, 11, 13], "paragraph": 19, "parallel": [10, 15, 21], "param": [4, 24], "paramet": [4, 8, 10, 11, 19, 22, 24, 28], "parameter": 11, "pardo": 0, "pareto": 19, "pari": [], "parikh": 0, "park": [0, 18, 28], "parker": 0, "parma": [4, 24], "parmaet": 22, "parmar": 0, "pars": [6, 11], "parser": [], "parso": [], "part": [4, 7, 8, 11, 14, 17, 19, 21, 22, 24, 28], "partial": [6, 7, 10], "particip": [12, 15], "particular": [2, 4, 15], "particularli": [2, 5, 7, 8, 12, 13, 15, 17, 26, 27, 28], "partit": [], "partli": 21, "pasini": 0, "pass": [4, 8, 11], "passion": 15, "past": 21, "patashnik": [], "patch": 11, "patchifi": 11, "path": [5, 11], "pathak": [], "pathtool": [], "patrick": [0, 9], "pattern": [0, 5, 17, 19, 23, 26], "paul": [0, 17], "pauli": [], "pave": [13, 15], "pavlov": [], "payn": [0, 18], "pbar": [4, 24], "pcws22": [], "pd": 5, "peak": 10, "pedalboard": [], "peebl": 0, "peeter": [0, 9], "peizhao": [], "penalti": 6, "peng": [], "pengi": [0, 8], "peopl": [16, 23, 28], "per": [11, 19, 24, 28], "perceiv": [12, 28], "percept": 12, "perci": 0, "percuss": 15, "pereira": [0, 25], "perez": [], "perfect": [19, 24, 26], "perfecto": [0, 9], "perform": [0, 4, 8, 9, 11, 12, 13, 15, 16, 19, 24, 25, 26, 28], "perhap": [11, 23], "period": 2, "perplex": 16, "perraudin": [0, 8], "person": [], "personalis": 7, "perspect": [8, 12], "pertin": 18, "peter": [0, 18, 28], "pexpect": [], "pgpf23": [], "pgxh23": [], "ph": 15, "phase": [17, 18], "phbd03": [0, 9], "phd": [0, 15, 27], "phil": [], "philip": [0, 5, 23], "philipp": 0, "phillip": [], "philosophi": 8, "photorealist": [], "photoshopgenerativefil": [], "phrase": [16, 28], "physic": 15, "piano": [0, 15, 16, 28], "pianotre": 0, "pianotreeva": 13, "pick": [10, 19], "piec": [4, 9, 12, 17, 24, 26, 27, 28], "pierr": [], "pieter": 0, "pietquin": [], "pillow": [], "ping": [0, 6], "pink": 13, "pinkl": [0, 8], "pip": [4, 10, 15], "pipelin": 8, "pitch": [0, 13, 17], "pixel": 19, "pjbm22": [], "plai": [4, 10, 19], "plakal": [], "plan": 20, "platformdir": [], "platt": [], "play": 5, "playback": 7, "playground": [], "playlist": [0, 8, 9, 18, 25], "pleas": [3, 17], "plot": 19, "plotli": [], "plt": 16, "plugin": [], "plumblei": 0, "pmlr": [0, 17, 28], "point": [11, 12, 19, 26], "polici": 19, "polit": 20, "polosukhin": 0, "polyak": [0, 17], "polyffus": 0, "polyfuss": 13, "polyphon": 0, "pon": 0, "pooch": [], "pool": 0, "poor": 12, "pop": [0, 24, 27], "popular": [4, 7, 8, 19], "poria": [0, 5], "posit": [5, 7, 16, 21, 26, 28], "possess": 16, "possibl": [3, 7, 8, 9, 10, 12, 13, 19, 21, 27, 28], "possibli": 5, "post": [11, 19], "post0": [], "post1": [], "post2": [], "posterior": [], "postolach": 0, "potenti": [8, 12, 15, 18, 20], "power": [0, 4, 5, 8, 16, 19, 24, 28], "pp27": [], "pp32": [], "pp33": [], "ppo": 19, "prabhudesai": [], "practic": [2, 3, 7, 8, 11, 15, 18, 19, 21, 22, 26, 28], "practition": [26, 28], "prafulla": [0, 18], "pre": [0, 4, 5, 6, 8, 9, 11, 16, 18, 19, 27], "preced": [14, 21, 28], "precis": [6, 8, 15, 19], "predefin": [6, 9, 26, 28], "predict": [0, 9, 11, 13, 14, 17, 19, 21, 26, 27, 28], "predominantli": 23, "preechakul": [], "prefer": [0, 15, 18, 19, 23, 25], "prefigur": [], "prefix": [0, 4, 8, 11, 14, 19, 22], "prefix_length": 4, "prefix_mask": 4, "prefix_project": 4, "prem": 0, "prepar": 3, "preprint": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "preprocess": 14, "present": [7, 9, 11, 12, 15, 18, 23, 25, 28], "preserv": 4, "press": 0, "presto": [0, 15], "pretext": 19, "pretrain": [0, 9, 12, 14, 15, 16, 19, 24, 28], "pretti": 21, "prevent": 20, "previou": [8, 19, 21, 23, 25, 28], "previous": [11, 15, 25, 27], "primari": 9, "primarili": [13, 15, 17, 23, 28], "principl": [4, 13], "print": [4, 24], "prior": [2, 9], "pritch": [], "privat": 8, "pro": 24, "probabilist": 0, "probabl": [0, 8, 11, 13, 14, 19, 21, 22], "probe": 19, "problem": [16, 17, 18, 19, 20, 21, 23, 28], "problemat": 27, "proc": [0, 9], "proccess": 11, "proce": 11, "procedur": [], "proceed": [0, 8, 9, 11, 18, 25], "process": [0, 2, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "prod_": [8, 9], "produc": [4, 5, 7, 8, 9, 19], "product": [0, 12, 15, 18, 19, 28], "program": [], "progress": [4, 5, 18, 19, 20], "progressbar": [], "proj": 11, "project": [4, 8, 11, 13, 21, 24, 28], "promin": [3, 24], "promis": [8, 19, 20, 25], "prompt": [0, 5, 10, 11, 18, 19], "prone": 19, "pronounc": 6, "propag": [0, 17], "propcach": [], "properti": [4, 11, 24, 28], "proport": [19, 26], "proportion": 27, "propos": [9, 28], "protobuf": [], "protocol": 6, "proven": 22, "provid": [3, 5, 6, 7, 9, 10, 11, 12, 15, 19, 20, 21, 22, 23, 26, 27, 28], "proxim": 19, "pschluter22": [], "psdv": [], "pseudo": [0, 8, 16, 18], "psk": [0, 2], "psutil": [], "psycholog": [], "pt": [4, 24], "ptyprocess": [], "public": 10, "publish": [0, 15, 23], "puckett": [0, 17], "puhrsch": [], "pull": 2, "pumarola": [], "pure": [11, 18, 19, 25, 28], "purpos": [7, 8, 12, 13, 21], "push": 28, "put": [2, 4, 21], "pw": [0, 18, 28], "px23": [0, 11], "py": [4, 10, 24], "py2": [], "py3": [], "pycpars": [], "pycr": [], "pydant": [], "pydantic_cor": [], "pydub": [], "pygment": [], "pyloudnorm": [], "pynndesc": [], "pypars": [], "pyplot": 16, "pystoi": [], "python": [4, 15, 24], "python3": [4, 10, 24], "python_multipart": [], "pythonhost": [], "pytorch": [4, 24], "pytorch_lightn": [], "pytz": [], "pyviz": [], "pyviz_comm": [], "pywavelet": [], "pyyaml": [], "pzh": [], "q": [8, 14, 26], "qa": [], "qi": [0, 9], "qian": [], "qiao": 0, "qifeng": [], "qing": [], "qingq": [0, 5, 18, 28], "qiuqiang": 0, "qiyang": [], "quad": 11, "quadrat": 11, "qualit": 23, "qualiti": [0, 6, 13, 17, 18, 19, 26], "quantiz": [11, 14, 19], "queen": 15, "queri": [0, 5, 7, 8, 9, 14, 16, 18, 19, 24, 25, 26, 27], "query_vector": 24, "question": [0, 3, 5, 6, 8, 11, 19, 22], "quick": [10, 19], "quickli": [4, 19, 26], "quinton": [0, 6, 8, 9, 15, 18, 28], "quit": [2, 11], "qun": [], "quoc": [0, 18], "quot": 2, "qwen": 8, "r": [0, 11, 15, 28], "r_1": 6, "rachel": [0, 8, 9, 18], "racial": 20, "radford": [0, 4, 18, 28], "radio": 28, "radiohead": 28, "radlinski": [0, 18, 25], "raffel": [0, 18], "rag": 20, "rai": [0, 18], "ram": 0, "ramalingam": [], "ramesh": [0, 28], "ramsauer": 0, "randn": 10, "random": [4, 24], "random_st": 16, "randomnam": [], "rang": [2, 4, 7, 8, 17, 18, 19, 22, 23, 24, 26, 27, 28], "rank": [19, 26, 27], "rank_q": 26, "rao": [], "rap": 16, "rapha": [], "raphael": [], "rapid": 15, "rapidli": 13, "raquel": [], "rare": [2, 21, 26, 28], "rashindra": [], "rate": [4, 10, 11, 12, 19, 24], "rather": [4, 8, 11, 12, 16, 17, 23, 27, 28], "rave": [0, 2], "ravi": [0, 18, 25, 28], "raw": [0, 11, 17, 24, 27, 28], "raymond": [0, 9], "rbl": [], "rdn": [], "re": [0, 3, 4, 7, 10, 11, 17, 22, 23, 24], "reach": 3, "reaction": 23, "readabl": 7, "reader": 11, "readi": 24, "real": [0, 2, 9, 12, 13, 15, 17, 19, 20, 23, 26, 27, 28], "realis": 8, "realist": [], "realiti": 16, "realiz": 19, "realli": [11, 19, 21, 22], "realm": [4, 24], "reaon": 12, "rearrang": [10, 21], "reason": [2, 9, 11, 20], "recal": 6, "receiv": [8, 19], "recent": [4, 7, 8, 9, 10, 13, 15, 16, 19, 20, 21, 22, 23, 24, 25, 28], "reciproc": 26, "recogn": [12, 19, 27], "recognis": [], "recognit": [0, 5, 9, 17, 27], "recommend": [0, 4, 5, 7, 10, 12, 17, 23, 25], "reconstruct": 14, "record": [], "recurr": [0, 9, 13, 22], "red": 21, "reduc": [6, 19, 20, 28], "refer": [0, 2, 11, 15, 19], "referenc": [], "refin": [23, 25], "reflect": [9, 12, 13, 19, 26, 28], "refram": 8, "regard": [7, 19, 27], "regardless": 28, "regener": 2, "regex": [], "regina": [], "region": [2, 11], "regress": [14, 18, 19], "regular": [11, 26], "reinforc": 19, "reinvent": 11, "reiss": [], "rel": [9, 16, 17], "relat": [15, 16, 19, 20, 23, 27, 28], "relationship": [6, 9, 17, 21, 25, 28], "relax": 5, "releas": 21, "relev": [11, 15, 18, 19, 24, 25, 26, 27, 28], "reli": [6, 8, 12, 13], "reliabl": [6, 26], "relianc": 28, "religi": 20, "relu": [4, 24], "remain": [8, 11, 16, 20, 23, 25, 28], "remark": 15, "remedi": 19, "remez": [0, 18], "remi": 13, "remind": 28, "remov": 11, "ren": [], "render": 10, "renum": 5, "renumics___song": [], "rep": [], "repeat": 5, "repeatedli": 23, "repetition_penalti": 4, "replac": [11, 19], "repo": [], "report": [0, 16, 18], "repositori": 15, "repres": [7, 12, 13, 16, 17, 19, 21, 25, 28], "represent": [0, 4, 6, 8, 12, 15, 17, 19, 28], "repurpos": 2, "request": [16, 23], "requir": [2, 7, 8, 11, 12, 15, 19, 20, 23, 25, 26, 27, 28], "requires_grad": [4, 24], "requires_grad_": 4, "rer": [0, 13], "resampi": [], "research": [0, 3, 9, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28], "reshap": 4, "residu": [11, 14], "resize_token_embed": 4, "resnick": [0, 17], "reso": [], "resolut": [2, 11, 14], "resort": 21, "resourc": [7, 10, 19], "respect": [7, 8, 9, 28], "respons": [0, 5, 8, 9, 12, 16, 19, 22, 23, 25], "respos": 8, "rest": [10, 11], "restrict": [8, 26, 27, 28], "result": [6, 8, 11, 12, 17, 19, 23, 26, 27, 28], "retain": 8, "rethink": [], "retriev": [0, 3, 6, 7, 8, 9, 13, 15, 16, 22, 26, 28], "retrieval_fn": 24, "return": [4, 5, 24, 26, 27], "return_tensor": [4, 24], "reus": 19, "reveal": 23, "revers": [0, 11], "review": [0, 3, 5, 6, 7, 8, 9, 11, 16, 18, 23, 28], "revisit": 23, "reward": 19, "rewon": [0, 18], "rez": 0, "rfb15": [], "rfer": [], "rg": [], "rgy": [0, 8, 9, 18, 28], "rhythm": [0, 2, 5], "rhythmic": 2, "ricardo": [], "riccardo": [], "rich": [16, 28], "richard": [], "richardson": [0, 9], "richer": [17, 19, 23, 28], "rif": [0, 13], "riff": [0, 2, 5, 24], "riffus": [0, 13], "rigel": [], "right": [5, 8, 11, 20], "rightarrow": [9, 11], "rinon": [], "rise": 17, "risk": 20, "rita": [0, 8], "rithesh": [0, 17], "ritter": [], "rkh": [0, 16, 28], "rkx": [], "rlhf": 19, "rlj": [], "rm": 19, "rmh": [], "rn": [], "rnn": [0, 8, 13, 21, 22], "rob92": [], "robbin": [], "robert": [0, 5, 17, 18], "roberta": [0, 21, 24, 28], "roberta_emb": 24, "robin": [], "roblek": 0, "robust": [25, 26, 28], "robustli": [0, 24, 28], "rock": [0, 2, 4, 5, 16, 17, 18, 24, 27, 28], "rod": [], "rodol": 0, "roform": [], "roger": [0, 8], "role": 18, "roll": 5, "romain": [0, 8, 9], "rombach": [], "ron": 0, "rongchen": [0, 5, 8, 9], "rongji": [], "ronneberg": [], "room": 2, "root": [4, 24], "roshan": [], "rot92": [], "rotari": [], "rothstein": [], "roug": 6, "rouge_l": [], "rough": 11, "round": 10, "rout": 8, "roux": 0, "rovan1997igm": 0, "row": 21, "rpd": [], "rpg": [], "rsr": [0, 18], "rubinstein": [], "ruff": [], "rui": [], "ruihan": 0, "ruiz": [], "rule": 0, "run": [2, 10, 11, 15], "runner": [], "runtim": [4, 24], "runtimeerror": [], "russel": [0, 5], "rvq": 14, "rvqgan": 0, "rwc": [0, 18], "rwd97": [0, 13], "rxl": [], "s3f": [], "s4": 13, "s41592": [], "s_": 11, "sa": 5, "sabet": [], "sabour": [], "sadeep": [], "safehttpx": [], "safer": 20, "safetensor": [], "sageev": 0, "saharia": [], "sai": [8, 10, 21, 23], "sain": 0, "saito": [], "sakkeer": [0, 5, 8], "sal": [], "salamon": [0, 5, 28], "salient": [2, 8, 15], "saliman": 0, "salmonn": [0, 8], "sam": [0, 18], "same": [8, 11, 12, 16, 19, 20, 21, 28], "sameer": 0, "sami": [], "sampl": [2, 4, 10, 11, 12, 19], "sample_r": 10, "sample_s": 10, "sampler": 10, "sampler_typ": 10, "samplernn": [0, 13, 17], "samuli": [], "san": 15, "sanakoyeu": [], "sander": [0, 11, 17], "sandhini": [0, 18, 28], "sandler": [0, 8, 9], "sang": [], "sanja": [], "sanjiv": [], "saroufim": [], "sashimi": 13, "sastri": [0, 28], "satisfact": [16, 26], "satisfi": [23, 26], "sauer": [], "saurabh": [], "saurou": [], "save": [10, 19], "savitzki": [], "saw": 19, "saxena": [], "saxophon": 28, "sbd": [], "sbr22": [], "sc": [], "scalabl": 0, "scale": [0, 8, 9, 10, 11, 12, 18, 28], "scatter": 16, "scc": [], "scdbk24": [0, 8], "scenario": [15, 23, 26, 28], "scene": [6, 11], "sch": [], "schedul": [], "schelten": [], "scheme": [8, 21], "schl": [], "schmidt": [0, 9], "schneider": [], "schoenfeld": [], "schulman": [], "schuster": [], "scienc": [0, 15, 23], "scientif": [], "scikit": [], "scikit_imag": [], "scikit_learn": [], "scipi": [], "scope": [16, 26, 28], "score": [0, 6, 8, 11, 19, 23, 26, 27], "scoroda18": [], "scott": [0, 9], "scratch": [2, 19, 23], "sd": 11, "sdcs23": [], "sdd": 5, "sde": [10, 11], "sdk": [], "sdwmg15": [], "search": [7, 15, 17, 18, 19, 23, 24, 25, 27, 28], "seb": [], "sebastian": [], "sec": [], "second": [0, 4, 8, 10, 11, 12, 17, 19, 21, 24, 27], "seconds_start": 10, "seconds_tot": 10, "secret": 10, "section": [4, 8, 9, 13, 14, 16, 19, 21, 22, 28], "see": [4, 6, 7, 8, 9, 11, 19, 21, 22], "seed": 10, "seek": [0, 2, 9, 11, 15, 23], "seem": 11, "seen": [2, 4, 8, 16, 19, 28], "seetharaman": 0, "segment": [9, 21, 22], "select": [12, 28], "self": [0, 4, 8, 10, 11, 15, 16, 19, 24, 28], "semant": [0, 6, 16, 17, 18, 19, 27, 28], "semantic_vers": [], "semanticscholar": 0, "semi": [0, 9], "senior": [0, 17], "sens": [11, 19, 21, 22, 24], "sensit": [12, 26], "sentenc": [6, 8, 9, 13, 16, 18, 21], "sentence_transform": 16, "sentencepiec": 21, "sentencetransform": 16, "sentri": [], "sentry_sdk": [], "seong": [], "separ": [0, 8, 11, 15, 19, 21, 27], "sepp": 0, "sequenc": [0, 8, 9, 10, 11, 19, 22, 28], "sequenti": [0, 4, 8, 24], "sergei": [], "sergio": [0, 28], "seri": [8, 11, 15], "serra": [0, 9, 28], "serv": [12, 14, 15, 16, 23, 25], "server": 15, "session": [0, 4, 8, 12], "set": [0, 5, 6, 8, 9, 10, 12, 18, 22, 25, 26, 27], "seth": 0, "setproctitl": [], "setup": 12, "setuptool": [], "seungheon": [0, 3, 5, 8, 15, 18, 23, 25, 28], "seungheond": 10, "seungjun": [], "seventh": [0, 8], "sever": [7, 8, 9, 12, 19, 23, 25, 26, 28], "sexual": 20, "seybold": [], "seyedhosseini": 0, "sfg": [], "sfjb21": [], "sfk24": [], "sft": 19, "sg64": [], "sgz": [0, 12], "sh22": [], "shan": [0, 5, 8], "shang": [], "shansong": [0, 5, 8], "shaohan": [0, 28], "shaoq": [], "shaoshu": [], "shaozh": [], "shape": [4, 24, 28], "sharan": [0, 18], "share": [8, 17, 28], "sharifi": 0, "shawn": [], "shayn": [0, 18], "shazeer": [0, 18], "she": [4, 15], "shechtman": [], "sheld": 11, "shelf": 11, "shellingham": [], "shen": [], "sheng": [], "shengfeng": [], "sherlock": [], "shi": [], "shibuya": [], "shift": [7, 9, 11, 17, 21, 28], "shih": [0, 18], "shihao": [], "shiliang": [], "shinji": [0, 18], "ship": 19, "shiqi": [0, 6], "shiran": [], "shivam": [], "shjl24": [], "shkk22": [], "shlomo": [0, 8, 9, 18, 28], "short": [0, 4, 9, 17], "shortcom": 7, "shortli": 8, "shot": [0, 9, 18, 28], "should": [4, 9, 10, 11, 15, 19, 25, 26], "show": [16, 19, 26, 27, 28], "shown": [5, 8, 13, 19, 20, 21, 28], "shrirao": [], "shu": [0, 18, 25], "shuai": [0, 28], "shubham": [0, 17], "shuffl": [4, 24], "shunt": [], "shuqi": [], "shusuk": [0, 6], "shyamal": [0, 18], "si": [], "siang": 0, "sicong": [], "siddhartha": [0, 18], "side": 2, "siggraph": [], "sigir": [0, 18, 25], "sigma_max": 10, "sigma_min": 10, "signal": [0, 4, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 28], "signatur": 13, "signifi": 15, "signific": [15, 23, 25, 27, 28], "significantli": [19, 28], "sil": [], "silenc": 10, "sim": 11, "simian": [], "similar": [0, 2, 6, 8, 11, 12, 17, 19, 21, 23, 24, 25, 26, 27], "similarity_metr": 24, "similarli": [6, 8, 28], "simon": [0, 8, 9, 18], "simonetta": [], "simonyan": [0, 17], "simpl": [0, 2, 4, 9, 13, 18, 19, 21, 25, 28], "simple_contrastive_loss": 24, "simpler": [10, 11, 21], "simplest": [2, 8, 9], "simpli": [2, 4, 8, 10, 11, 19, 22], "simplifi": [], "simultan": [0, 13, 16, 28], "sinc": [8, 11, 12, 13, 19, 21, 27], "sing": [0, 4, 15, 24], "singer": [0, 4, 5, 24], "singh": [0, 8], "singl": [0, 8, 9, 10, 11, 12, 16, 17, 18, 26, 28], "singsong": [0, 2], "siraichi": [], "site": 10, "situat": [12, 19, 22], "sivic": [0, 5], "six": [], "siyu": [], "siyuan": [], "size": [4, 5, 8, 9, 11, 19, 21, 24, 28], "sjscholkopf23": [], "sk": [], "skals": [], "sketch": 8, "sketchnet": [0, 13], "skip": 19, "skip_special_token": 4, "sklearn": 16, "skoglund": [], "skyrocket": 4, "slama": [0, 18], "slbr23": [], "slc07": [0, 17], "slightli": [8, 21, 23], "slow": [19, 24], "slowdown": 11, "small": [4, 9, 11, 12, 19, 21, 26], "smaller": [19, 28], "sme20": [], "smith": [], "smitin": [0, 2], "smmap": [], "smollei": [], "smooth": 28, "snake": 11, "sne": 16, "sniffio": [], "so": [2, 4, 6, 8, 10, 11, 15, 19, 21, 22], "so17": [0, 13], "soar": 4, "social": [0, 16, 17], "societi": [0, 6, 8, 9, 15, 18, 20, 23], "soft": [5, 16, 28], "softmax": [8, 27], "softwar": 15, "soham": [0, 8], "sohl": 0, "sojoudi": [], "solid": 15, "solo": [4, 5, 28], "solut": [3, 18, 19], "solv": [11, 16, 19, 20, 21], "solver": 11, "somayeh": [], "some": [2, 4, 5, 7, 8, 9, 10, 11, 13, 17, 19, 21, 22], "someth": [11, 19], "sometim": [8, 9, 19], "son": [], "song": [0, 2, 4, 5, 9, 10, 23, 24, 27, 28], "soni": [13, 15], "soon": 13, "sophist": [3, 4, 8, 25, 28], "soprano": 4, "sora": 19, "sordo": [0, 17], "soroush": [0, 17], "sort": [2, 11, 26], "sot": 21, "sotelo": [0, 17], "soujanya": [0, 5], "soumith": [], "sound": [0, 4, 5, 8, 9, 11, 16, 17, 18, 24, 27, 28], "soundctm": [], "soundfil": [], "soundstorm": [], "soundstream": [], "sourc": [0, 5, 15, 17, 19], "sourcetensor": 4, "sourish": [], "southern": 15, "souza": [], "space": [0, 3, 4, 8, 11, 13, 16, 19, 23, 28], "spam": 20, "span": [7, 15, 18], "spanish": 24, "speak": 11, "special": [16, 19, 21, 25, 28], "specialis": 8, "specif": [2, 3, 4, 8, 9, 12, 13, 15, 17, 18, 19, 21, 22, 27, 28], "specifi": 23, "speck": [0, 9], "spectral": 11, "spectrogram": [0, 11, 14, 21, 28], "spectrum": [13, 23], "speech": [0, 5, 8, 9, 12, 13, 17, 18, 19, 27, 28], "spend": 19, "spice": 6, "spider": 6, "spijkervet": 0, "spirit": 11, "spl": 0, "split": [4, 24], "spm": 15, "spotifi": 15, "springer": [0, 28], "sqrt": 8, "squar": 21, "sr": 10, "src": 5, "srikumar": [], "srivatsan": [0, 8], "ssdk": [0, 11], "ssw": 0, "stabil": [10, 15], "stabilityai": 10, "stabl": [0, 2, 11, 16], "stable_audio_tool": 10, "stableaudio": 13, "stack": 11, "staff": 15, "stage": [4, 8, 19], "standard": [2, 4, 6, 8, 11, 21, 22, 26], "stanlei": [], "starlett": [], "start": [8, 9, 10, 11, 13, 17, 21, 22, 23], "startup": 13, "stasyuk": [], "state": [0, 2, 8, 9, 13], "static": [6, 9, 15, 19, 28], "statist": 12, "steadi": 20, "steer": 23, "steerabl": 0, "stefan": 0, "stefano": 0, "steinmetz": [], "stem": [2, 28], "stemgen": [0, 2], "step": [0, 6, 8, 10, 11, 12, 14, 19, 20, 22, 26], "stephen": [0, 17], "stereo": 0, "steven": [0, 5, 8, 9, 18], "stft": [11, 14], "still": [8, 12, 16, 19, 20, 21, 23], "stimulu": 12, "stitch": [], "stochast": [0, 11], "stoi": [], "stoller": [0, 8, 9, 18], "stop": 10, "store": [], "stori": 24, "straight": 19, "straightforward": [12, 21], "strategi": [0, 6, 18, 28], "strength": [11, 12], "strictli": 10, "string": [4, 11, 24], "strong": [2, 20, 24], "stronger": 28, "strongli": [2, 6], "strub": [], "struct": 7, "structur": [0, 2, 9, 13, 19, 23, 27, 28], "struggl": [19, 23, 28], "strum": 24, "student": [15, 19], "studi": [3, 9, 11, 15, 17, 23, 27, 28], "style": [2, 9, 11, 13, 17, 23, 28], "su": [], "sub": [9, 21, 22], "subject": [0, 5, 6, 12], "sublinear": [], "submit": 19, "subscript": 22, "subsequ": [6, 13, 23], "subset": [4, 21, 24, 27], "substanti": [15, 28], "substr": 22, "subtl": [12, 28], "subword": [21, 28], "succeed": 3, "succes": 11, "success": [8, 21, 22, 25], "suffici": [19, 25, 27, 28], "suggest": [4, 8, 10, 23, 25], "suha": [], "suhail": [], "suitabl": [5, 8, 9, 12], "suk": [], "sum": [4, 24], "sum_": [8, 26, 28], "sumbali": [], "summar": [12, 19], "summari": [7, 9], "summaris": 6, "summit": 20, "sun": [0, 5, 8, 13], "sungroh": [], "sunni": [5, 28], "suno": [0, 13], "suo": [], "supasorn": [], "super": [4, 24], "superior": 12, "supervis": [0, 8, 9, 15, 17, 18, 19, 21, 27, 28], "supplement": [5, 15, 24], "suppli": 21, "support": [7, 10, 17, 20, 23, 24, 25], "suppos": 21, "sure": [4, 10, 24], "surround": [19, 21, 22, 28], "survei": [0, 15, 17], "surya": [], "sustain": 25, "sutskev": [0, 18], "suttisak": [], "suwajanakorn": [], "svn37": [], "swap": [0, 28], "swave": [], "sweep": 10, "sweet": 3, "swerski": [], "swiss": [0, 6], "swy": [], "sylvain": [], "symbol": [0, 13], "sympi": [], "synchron": [0, 18], "synnaev": [0, 18], "syntact": 6, "synth": 5, "synthes": 13, "synthesi": [0, 13, 17], "synthet": [0, 7, 25], "system": [0, 2, 4, 6, 7, 8, 9, 10, 12, 16, 17, 18, 19, 23, 24, 26, 27, 28], "szk": [], "szu": [0, 17, 18], "t": [0, 2, 3, 5, 6, 8, 9, 10, 11, 16, 18, 19, 21, 23, 24, 27, 28], "t1": 10, "t5": [11, 14, 16, 19, 21, 22], "t_i": 28, "t_j": 28, "tabl": [8, 19, 21], "tackl": [8, 28], "taehong": [], "taesu": [0, 23, 25], "taesung": [], "tag": [0, 2, 4, 5, 9, 16, 17, 18, 24, 27], "tagliasacchi": [0, 5, 18], "tai": [0, 18], "taigman": [0, 17], "tak": 8, "takahashi": [0, 6], "takashi": [], "take": [2, 4, 6, 7, 8, 11, 14, 19, 20, 27, 28], "takida": 0, "tal": [0, 18], "talent": [4, 24], "tali": [], "talk": [0, 2, 7, 19, 25], "tallini": 0, "tan": [0, 8], "tang": [0, 6, 8], "tanh": 8, "tao": [], "tar": [], "target": [8, 11, 14, 27], "task": [0, 3, 4, 6, 7, 8, 11, 12, 13, 14, 15, 18, 20, 21, 26, 27], "taslp": 15, "tat": [], "tau": 28, "taylor": [0, 8, 15, 18, 28], "tb": 10, "tb_name": 10, "tbtl08": [0, 17, 18, 27], "tc02": [0, 9], "te_dataload": [4, 24], "tea": 5, "teach": [0, 15, 17, 18], "teacher": 19, "teboul": [], "tech": [], "technic": [0, 15, 18, 22, 23], "techniqu": [0, 8, 12, 15, 18, 21], "technologi": [0, 15, 17, 18, 23, 27], "teh": 2, "tejasvi": [0, 18, 25], "telecommun": [0, 12], "tell": 24, "temperatur": [4, 24], "templat": 5, "tempo": [5, 16, 17, 24, 27], "tempor": [0, 2, 4, 5, 9, 11], "ten": 19, "tenac": [], "tencent": 15, "tend": [2, 20, 23], "tendenc": 26, "tenenbaum": [], "tensor": [4, 24], "tensor_numpi": [], "tensorboard": [], "tensorboard_data_serv": [], "teoh": [], "ter": [], "term": [0, 2, 6, 7, 8, 9, 11, 12, 16, 17, 18, 19, 27, 28], "termcolor": [], "tero": [], "test": [4, 19, 20, 24], "test_dataset": [4, 24], "tester": 12, "teuwen": [], "text": [0, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 22, 23, 24, 26, 27], "text2song": 2, "text_embedding_dim": [4, 24], "text_encod": 24, "text_forward": 24, "text_model": 4, "text_output": 24, "text_project": 24, "text_token": [4, 24], "textrm": [11, 22], "textsubscript": 6, "textual": [0, 2, 12, 13, 18, 28], "textur": 23, "textwrap": 4, "tf": 6, "th20": [], "thabet": [], "thabo": [], "than": [2, 4, 8, 9, 11, 12, 15, 16, 17, 19, 20, 23, 27, 28], "thang": [], "thank": [16, 28], "thdl24": [0, 13], "thei": [4, 6, 7, 8, 9, 11, 15, 16, 19, 21, 22, 23, 25, 28], "them": [2, 4, 6, 8, 10, 11, 12, 13, 19, 21, 28], "theme": [17, 28], "theori": 8, "therebi": 15, "therefor": [8, 9, 12], "thereof": 8, "theres": 11, "thermodynam": [], "thesi": [0, 15, 27], "theta": [8, 11, 22], "thi": [2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "thibault": [], "thickstun": 0, "thierri": [0, 17], "thing": [2, 4, 10, 11, 20, 24], "think": [4, 9, 11, 19, 28], "third": [0, 8], "thirti": [0, 8], "thoma": 0, "those": [9, 19, 20, 21, 22], "though": [8, 9], "thread": 8, "threadpoolctl": [], "three": [5, 12, 13, 19, 21, 23], "threshold": 26, "through": [0, 2, 3, 4, 5, 7, 8, 9, 10, 13, 15, 16, 17, 18, 19, 20, 23, 24, 25, 27, 28], "throughout": 5, "throughput": 20, "thu": [11, 15], "ti": 8, "tian": [0, 8], "tianqi": [], "tianwei": [], "tianyu": [0, 28], "tie": [0, 9], "tier": 15, "tifffil": [], "tight_layout": 16, "tillet": [], "tim": 0, "timbr": [2, 16], "timbretron": [], "time": [0, 2, 8, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 27, 28], "timeless": 4, "timelin": 13, "timescal": [8, 9], "timestep": 11, "timo": [0, 5, 18], "ting": [0, 17], "tinghui": [], "tip": 24, "titl": 16, "tl89": [0, 13], "tn": 26, "to_html": 5, "todai": [7, 8, 24], "todd": 0, "todo": [], "togeth": 28, "token": [0, 3, 4, 6, 8, 9, 10, 11, 14, 19, 20, 28], "tokenization_utils_bas": [4, 24], "tokenizers_parallel": 10, "tom": 0, "tomer": 0, "tomlkit": [], "tommi": [], "too": [4, 19, 22, 24, 26], "tool": [4, 10, 13, 18, 25], "toolkit": [], "top": [8, 15, 19, 21, 26, 28], "top_k": 4, "top_p": 4, "topic": [3, 9, 13, 15, 19, 23], "topk": 24, "torch": [4, 10, 16, 24], "torch_stoi": [], "torchaudio": [4, 10, 24], "torchdiffeq": [], "torchlibrosa": [], "torchmetr": [], "torchsd": 10, "torchvis": 10, "tornado": [], "torr": [0, 17, 18, 27], "toshimitsu": 0, "total": 10, "total_loss": [4, 24], "toutanova": [0, 18, 28], "tov": [], "tovstogan": [0, 5, 23], "toward": [0, 5, 7, 8, 9, 11, 17, 18, 20, 23], "tp": 26, "tqdm": [4, 24], "tr": [], "tr_dataload": [4, 24], "trace": [3, 7, 18], "traceback": [4, 10, 16, 24], "track": [0, 5, 8, 9, 17, 27, 28], "track2emb": [4, 24], "track_id": [4, 24], "tradeoff": 20, "tradit": [3, 13, 25, 27], "tradition": 9, "train": [0, 2, 3, 5, 6, 7, 9, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27], "train_dataset": [4, 24], "train_loss": [4, 24], "train_parma": [4, 24], "traitlet": [], "trajectori": [0, 23], "trampolin": [], "tran": 0, "transact": [0, 9, 17, 18, 27], "transcript": 15, "transfer": [0, 2, 13, 18, 21, 28], "transform": [0, 4, 5, 8, 9, 11, 13, 14, 16, 18, 19, 21, 22, 24, 28], "transit": 18, "translat": [0, 4, 6, 7, 8, 11, 17, 19], "transpar": [], "transport": 4, "treat": [3, 8, 11, 21, 23, 27], "tremend": 21, "trend": [8, 13], "tri": [], "trick": 2, "trigger": [], "triplet": [3, 28], "true": [4, 10, 16, 24, 26], "truli": [2, 4], "truncat": [4, 24], "trung": [], "trust": 24, "truth": [4, 6, 24, 26], "try": [4, 16, 22, 24], "tsa": [0, 9], "tsai": [], "tsne": 16, "tsung": [0, 9], "ttm": [2, 11], "ttmr": 16, "tu": [], "tune": [4, 19, 24], "tuoma": [], "tupl": 9, "turab": 0, "turbo": [], "turn": [9, 10, 11, 18, 21], "turnbul": [0, 9, 17, 18, 27], "tutoir": [], "tutori": [2, 4, 9, 11, 13, 15, 16, 17, 19, 21, 22, 24], "twelfth": [0, 8], "two": [0, 3, 7, 8, 12, 13, 14, 17, 21, 27, 28], "txt": 15, "ty": [0, 8], "type": [4, 5, 7, 8, 12, 13, 21], "typer": [], "typic": [6, 7, 8, 9, 12, 19, 23, 27, 28], "tzanetaki": [0, 9], "tzdata": [], "tzg": [0, 2], "u": [3, 4, 11, 19, 20, 22, 28], "uc": [], "ucsd": [], "udi": [0, 13], "udio": [0, 13], "uesaka": 0, "ugen": [], "uh": [], "ultim": 12, "umap": [], "umap_learn": [], "umbrella": [7, 8], "umg": 15, "un": 21, "unannot": 27, "unattribut": 2, "uncommon": 21, "uncondit": [0, 17], "under": [8, 15, 27], "underbrac": 22, "undergo": 8, "understand": [0, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 18, 22, 23, 25, 26, 28], "understnad": 24, "understood": 20, "unequivoc": 8, "unfamiliar": 28, "unfeas": 27, "unforgett": 21, "unfortun": 23, "uni": [], "uni01": [0, 12], "unifi": [0, 8, 16, 18], "unigram": 6, "union": 0, "uniqu": [3, 12, 20], "unit": [0, 6, 8, 9, 28], "univ": [], "univers": [0, 15, 17, 18, 28], "unknown": [21, 28], "unlabel": [7, 21], "unlik": [11, 12, 19, 21, 25, 28], "unlimit": 28, "unrel": 28, "unresolv": 15, "unrestrict": 27, "unrol": 8, "unsatisfactori": 23, "unseen": [17, 21], "unsqueez": 4, "unstabl": 2, "unsupervis": [0, 4, 18], "unterthin": 0, "until": [11, 22], "unus": 2, "up": [2, 6, 10, 11, 19, 20, 21], "upbeat": [2, 5, 23, 24, 25, 27, 28], "updat": [0, 4, 19], "upend": 2, "uplift": 5, "upload": 10, "upon": 23, "uriel": 0, "url": [0, 5, 6, 8, 9], "urllib3": [], "urtasun": [], "us": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28], "usa": 15, "usabl": 16, "usag": [17, 20, 26], "usai": [], "user": [0, 8, 9, 10, 15, 16, 18, 19, 23, 24, 25, 26, 27, 28], "userwarn": [4, 10], "usic": [], "usr": [4, 24], "usual": [6, 7, 8, 9, 22, 27], "uszkoreit": 0, "utf": 5, "util": [4, 14, 24, 28], "uvicorn": [], "v": [0, 8, 9, 11, 14, 18], "v1": [0, 8, 9], "v2": [], "v3": [], "v4": [4, 24], "v_diffusion_pytorch": [], "va": [], "vae": [0, 11, 19], "vahdat": [], "vajda": [], "valid": [4, 24], "valid_loss": [4, 24], "valu": [6, 11, 12, 14, 19, 26, 27], "valuabl": [12, 23, 25], "vampnet": [0, 2, 13], "van": [0, 17, 28], "vandergheynst": [], "vari": [0, 2, 8, 11, 18], "variabl": [9, 10, 11], "varianc": 12, "variant": [6, 7, 9], "variat": [0, 8, 9, 19, 23], "varieti": [7, 8, 16, 19, 21, 23], "variou": [5, 15, 16, 18, 19, 27, 28], "varun": [], "vascipy10contributors20": [], "vast": 28, "vastli": 21, "vasudevan": 0, "vaswani": 0, "vdodz": [0, 13, 17], "vdov": [], "ve": [2, 3, 4, 8, 24], "vector": [0, 2, 8, 11, 14, 18, 19, 24, 28], "vector_quantize_pytorch": [], "veit": [], "ventur": 2, "venv": [], "veri": [10, 19, 21, 27], "vers": 2, "versatil": [], "version": [4, 11, 12, 28], "versu": 28, "verzetti": [0, 5, 18], "vesa": [], "vggish": 12, "via": [0, 5, 6, 8, 14], "vibe": [5, 24], "vicki": 0, "vicol": [], "video": [0, 5, 7], "videocrafter1": [], "view": [19, 27], "vijai": 0, "vincent": [0, 18], "vinh": [], "vinyal": [0, 17, 28], "violin": [4, 24], "virtanen": [], "virtual": [10, 19, 28], "visheratin": [], "visio": 0, "vision": [0, 5, 6, 16], "visit": [0, 5, 6, 8, 9], "visual": [0, 16, 28], "vocabulari": [9, 16, 17, 18, 21, 28], "vocal": [2, 4, 5, 24, 28], "vocalist": [4, 24], "vocod": [0, 11], "voic": [2, 4, 24], "volkmann": [], "volum": [0, 2, 5, 8, 9, 19], "voss": [], "voznesenski": [], "vq": 19, "vqgan": 19, "vri": [], "vsp": [0, 14], "vulner": 24, "w": [0, 8, 11, 16], "wa": [4, 13, 15, 17, 19, 21, 24, 28], "wade": [], "wai": [2, 3, 5, 6, 8, 11, 13, 15, 16, 18, 19, 21, 23, 24, 27], "wainwright": [0, 18], "wakaki": [0, 6], "walk": [0, 11, 25], "wallac": [], "wandb": [], "wang": [0, 6, 8, 18], "wang_self": [], "wanmo": [], "want": [2, 4, 5, 10, 17, 19, 21, 22, 23, 24, 27], "warn": [4, 10, 16, 24], "watanab": [0, 18], "watson": [], "wattenhof": [0, 8], "wav": 5, "wav2vec2featureextractor": 4, "waveform": [11, 17, 28], "wavegan": 17, "wavenet": [0, 13, 17, 19], "wbz": [0, 18], "wcmb": [], "wcs21": [0, 9], "wcwidth": [], "wcy22": [], "wcz": [0, 12, 28], "wdwb23": [0, 2, 11, 18], "wdwb24": [], "we": [2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 27, 28], "weak": [0, 12], "weaker": 16, "web": [0, 15, 19, 28], "webdataset": [], "webencod": [], "websocket": [], "weck": [0, 5, 6, 23, 28], "weer": 0, "wei": [0, 6, 8, 9, 18, 28], "wei_finetuned_2021": [], "weigh": [8, 28], "weight": [4, 6, 8], "weihao": [], "weikang": [], "weili": [], "weiner": 11, "weird": 11, "weiss": [], "weituo": [0, 9], "welcom": [15, 24], "well": [2, 4, 6, 8, 11, 12, 15, 16, 18, 19, 20, 22], "wen": [0, 5, 9], "weng": [], "wenhao": [0, 5, 8, 9], "wenhu": [0, 5, 8, 9], "wenwu": 0, "wenyi": [0, 8], "were": [3, 6, 17, 19, 21, 22, 23, 27, 28], "werkzeug": [], "wgen23": [], "wget": [], "wgn23": [], "whang": [], "what": [8, 9, 11, 19, 21, 23, 27], "whb": [], "when": [3, 6, 7, 8, 9, 12, 14, 18, 19, 21, 22, 23, 26, 27, 28], "whenev": 17, "where": [2, 4, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28], "wherea": 22, "whether": [9, 17, 19, 26], "whi05": [0, 27], "which": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 26, 28], "whiil": 2, "while": [2, 8, 11, 12, 15, 16, 19, 20, 21, 23, 25, 28], "whisper": [15, 19, 21], "whistl": 5, "whitman": [0, 17, 27], "whl": [], "who": [2, 4, 24, 28], "whole": [9, 19], "wht24": [], "why": [11, 15, 19, 24], "wichern": 0, "wide": [7, 8, 12, 16, 17, 18, 19, 21], "wider": [11, 16], "widmer": [], "width": [4, 11], "wikimut": [0, 5, 28], "wikipedia": 5, "wilei": [], "william": [0, 18], "wimbauer": [], "winter": [], "wise": [2, 11], "wish": [3, 8, 11], "within": [2, 4, 8, 9, 16, 26, 28], "without": [2, 6, 8, 13, 16, 18, 19, 23, 28], "wizadwongsa": [], "wjt": [], "wkgs24": [0, 28], "wmb": [0, 6], "wmd": [], "wnn": [0, 5, 9], "wojciech": 0, "wolf": [0, 17], "womanli": 4, "women": 4, "won": [0, 5, 9, 10, 18, 23, 28], "wong": [], "wook": [0, 15, 18, 28], "word": [0, 2, 11, 16, 17, 18, 19, 21, 22, 27, 28], "work": [2, 3, 8, 9, 11, 13, 15, 19, 21, 22, 24, 25, 28], "workflow": [2, 15], "workshop": [0, 8], "world": [4, 9, 15, 17, 19, 23, 26, 27, 28], "worst": 19, "worth": 8, "would": [9, 11, 23, 27, 28], "wrap": [2, 4, 10], "wrapper": 10, "wrapt": [], "wright": 0, "write": [2, 4, 16], "writeup": 11, "written": [], "wte": 4, "wu": [0, 5, 9, 18, 23, 28], "ww": [], "www": 0, "wxfx24": [], "wy23": [], "wzz": [0, 13], "x": [8, 9, 11, 17, 24], "x_": 28, "x_0": [], "x_1": [], "x_2": [], "x_m": [], "x_t": [], "x_transform": [], "xavier": [0, 9, 28], "xi": 0, "xia": 0, "xiang": [], "xiangyu": [], "xianzhao": [0, 8], "xiao": [], "xiaob": [0, 9], "xiaodong": [], "xiaogang": [], "xiaohuan": [], "xiaoliang": [], "xiaoyu": [], "xie": [0, 5, 9, 28], "xin": [], "xinchao": [], "xing": [], "xingjian": [], "xinhao": 0, "xintao": [], "xinyin": [], "xiufeng": [], "xu": [0, 18], "xubo": 0, "xuchan": [], "xuchen": [0, 9], "xudong": [], "xue": [], "xuefeng": [], "xuezhi": [0, 18], "xun": [], "xyzservic": [], "xzy": [], "y": [0, 8, 9, 17, 18, 24], "y_": [8, 9], "y_1": [8, 9], "y_2": [8, 9], "y_t": [8, 9], "yael": [], "yan": [], "yanbo": [], "yang": [0, 6, 17, 18], "yaniv": [0, 17], "yanqi": [0, 18], "yanzuo": [], "yao": [], "yaofang": [], "yariv": [], "yarl": [], "yaron": [], "yatong": [], "yazh": [0, 28], "ycontributors21": [], "ycy17": [0, 13], "ye": 27, "year": [7, 8, 9, 11, 22], "yee": [], "yellow": 13, "yen": [], "yeong": [], "yeongmin": [], "yesil": 0, "yet": [4, 8, 17], "yeung": 0, "ygp": [], "ygz": [], "yi": [0, 17, 18], "yichun": [], "yijin": [], "yilun": [], "yin": 0, "yinfei": [], "ying": [0, 5, 8], "yinghai": [], "yinghao": [0, 5, 8, 9], "yinhan": [0, 24, 28], "yinhuai": [], "yiqin": [], "yixiao": [0, 5, 23], "yiyi": 0, "yk": [], "yoav": [0, 8, 9], "yogesh": [], "yong": [], "yonghui": 0, "yoo": [], "yoon": [], "york": 15, "yoshua": [0, 17], "yossi": [0, 18], "you": [0, 3, 4, 5, 9, 10, 11, 15, 17, 19, 21, 24], "youngjung": [], "youngmoo": [0, 9], "your": [0, 3, 4, 6, 10, 24], "your_hf_token": 10, "yourself": [10, 15], "youtub": [], "youtube8m": 5, "yt": 5, "yt8m": [], "yu": [0, 8, 17, 18], "yuan": [0, 28], "yuancheng": [], "yuanzhen": [], "yuanzhong": [], "yuchen": [0, 28], "yudong": [0, 5, 8, 9], "yue": [0, 8, 9, 18, 28], "yueh": [], "yufeng": [], "yuhao": [], "yuhta": 0, "yujiu": [], "yukara": 0, "yuki": [0, 6], "yukio": [], "yuliang": [], "yulun": [], "yume": [], "yun": [0, 9], "yunfei": [], "yunfeng": [], "yunjei": [], "yunji": [], "yunxuan": [0, 18], "yupe": 0, "yuqe": [], "yusong": [0, 5, 18, 23, 28], "yutong": 0, "yuval": [], "yuxi": [], "yuxuan": 0, "ywv": [0, 16], "ywz": [], "yxk": [], "z": 11, "z1": 24, "z2": 24, "z_audio": 24, "z_text": 24, "zach": 0, "zachari": [0, 3, 5, 9, 15, 18], "zack": 0, "zackeri": 19, "zada": [], "zal": [0, 5, 18], "zaremba": 0, "zcc": [], "zcdb24": [0, 11], "zeghidour": [], "zehua": 0, "zejun": [0, 8], "zen": [0, 17], "zeqian": [], "zero": [0, 11, 18, 28], "zero_grad": [4, 24], "zettlemoy": [], "zeyu": [], "zhang": [0, 5, 8, 9, 17, 18, 23, 25, 28], "zhang_bertscore_2020": [], "zhao": [0, 6, 18], "zhaoyang": [], "zhen": [], "zheng": [], "zhengdong": [], "zhenhui": [], "zhf": [], "zhi": [0, 6], "zhide": [], "zhifeng": [], "zhihong": [], "zhije": [], "zhiji": [], "zhiqi": [], "zhishuai": [], "zhiyao": 0, "zhizheng": [], "zhong": [0, 6], "zhongyi": [], "zhou": [0, 18, 28], "zhouhang": [0, 5, 9], "zhouyu": [0, 17], "zhu": 0, "zhuang": [], "zhuoyuan": [0, 6], "zihao": [0, 5, 8, 9], "zijian": [], "zip": [], "ziqi": [], "zirui": 0, "ziv": 0, "ziwei": [], "zix": [0, 2], "zixun": [0, 5], "ziyu": 0, "zizheng": [], "zlo": [], "zongyu": [], "zoph": [0, 18], "zornitsa": [0, 8, 9], "zou": [], "zra23": [], "zuchao": [0, 6], "zukowski": 0, "zuluaga": 0, "zwcd23": [0, 11], "zzm": [0, 6], "\u00e0": 0, "\u00e1": [0, 5, 18], "\u00e4": 0, "\u00e4\u00e4": [], "\u00e7": 0, "\u00e9": [0, 18], "\u00eb": 0, "\u00ed": 0, "\u00f6": [0, 8, 9, 18, 28], "\u00fc": [], "\u0103": [], "\u02c6": []}, "titles": ["Bibliography", "Beyond Audio Modality", "Beyond Text-Based Interactions", "Conclusion", "Code Practice", "Datasets", "Evaluation", "Introduction", "Models", "Tasks", "Code Tutorial", "Diffusion Model-based Text-to-Music Generation", "Evaluation", "Introduction", "MusicGEN", "Connecting Music Audio and Natural Language", "Why Natural Langauge?", "Background", "Overview of Tutorial", "Advances", "Challenges", "The Framework", "Introduction", "Challenges", "Code Practice", "Conversational Retrieval", "Evaluation", "Introduction", "Models"], "titleterms": {"": [4, 24], "1": [4, 16, 24], "2": [4, 16, 24], "3": [4, 16, 24], "4": [4, 24], "A": [], "And": 7, "In": 19, "The": [7, 21], "about": 15, "abstract": 2, "adapt": [8, 21], "address": [], "advanc": 19, "aim": 15, "align": 19, "almost": 16, "anchor": 12, "annot": 17, "answer": 9, "appli": 28, "ar": [8, 22], "architectur": [8, 11, 24, 28], "attent": 21, "attribut": 28, "audio": [1, 10, 12, 14, 15, 28], "audio2audio": 2, "augment": [19, 28], "author": 15, "automat": 6, "autoregress": [8, 21], "ax": 7, "background": 17, "base": [2, 6, 11], "benefit": [25, 28], "beyond": [1, 2, 28], "bibliographi": 0, "brief": [], "build": [4, 24], "call": 19, "caption": [9, 23], "chain": 19, "challeng": [20, 23, 25], "channel": 21, "class": [4, 24], "classif": 9, "code": [4, 10, 24], "codec": 14, "complex": [], "concaten": 21, "conclus": [3, 4, 24], "condit": [8, 11, 21], "connect": 15, "context": 19, "continu": 11, "control": 2, "convers": [9, 25], "creat": [4, 24], "cross": 21, "data": [4, 24, 28], "databas": [], "dataset": [4, 5, 24], "decod": [8, 19, 21], "definit": 13, "denot": [], "describ": [], "descript": [7, 8, 9, 18], "dialogu": [], "diffus": 11, "direct": 25, "distanc": 12, "distil": 19, "distribut": 23, "divers": [12, 28], "do": 7, "don": [], "earli": [17, 27], "effici": 20, "embed": 28, "emploi": 28, "encod": [8, 16, 19, 21], "engin": 24, "environ": [4, 24], "evalu": [6, 12, 26], "exampl": [], "fad": 12, "fall": [], "feedback": 19, "fid": 12, "framework": 21, "friendli": 16, "from": 19, "fr\u00e9chet": 12, "function": [19, 28], "further": 24, "fusion": 8, "futur": 25, "gener": [11, 17, 18, 19], "get": [4, 15, 24], "handl": 28, "hidden": 12, "histori": 13, "human": [5, 16, 19], "i": [7, 16, 28], "implement": 21, "incept": 12, "infer": 24, "initi": 28, "input": 19, "instruct": 8, "interact": 2, "interfac": 16, "introduct": [4, 7, 13, 22, 24, 27], "iter": 11, "joint": 28, "k": 21, "kei": 25, "label": 16, "langaug": [16, 18], "languag": [15, 19, 21, 22], "law": 19, "learn": [16, 19, 24, 28], "let": [4, 24], "leverag": 28, "limit": [12, 19, 23, 25], "listen": 12, "ll": 24, "llm": 8, "load": 4, "loss": 28, "make": 24, "mask": 21, "match": 6, "mc": [], "mean": 12, "method": 27, "metric": [6, 28], "mismatch": 23, "mo": 12, "modal": [1, 28], "model": [4, 8, 11, 18, 19, 21, 22, 24, 28], "modul": [8, 21], "motiv": 15, "mqa": [], "mtc": [], "multi": 28, "multimod": [8, 19], "multipl": 12, "mushra": 12, "music": [2, 7, 8, 9, 11, 15, 17, 18], "musiccap": [], "musicgen": 14, "musictextclip": [], "nativ": 8, "natur": [15, 16], "need": 7, "neg": 28, "neural": 14, "normal": 21, "open": 10, "opinion": 12, "other": 6, "our": [4, 24], "out": 27, "output": 19, "overview": [7, 18], "paradigm": [], "part": [], "perform": 20, "practic": [4, 24], "pre": 28, "precis": 26, "prefix": 21, "prerequisit": [4, 24], "problem": [13, 27], "qualiti": 12, "queri": [23, 28], "question": 9, "rag": 19, "reason": 19, "recal": 26, "refer": [5, 6, 8, 9, 12, 17, 18, 23, 25, 27, 28], "refin": 11, "relev": 12, "represent": [11, 16, 21], "resourc": [4, 24], "result": 4, "retriev": [17, 18, 19, 23, 24, 25, 27], "safeti": 20, "sampl": 28, "scalabl": 16, "scale": 19, "score": 12, "sdd": [], "section": 7, "semntica": 28, "sentenc": 28, "sequenc": 21, "set": [4, 24], "shot": 19, "similar": 28, "singl": [23, 25], "song": [], "sourc": 28, "stabl": 10, "stableaudio": [], "stage": [17, 27], "start": [4, 15, 24], "static": [], "step": [4, 24], "still": [], "stimuli": 12, "strateg": 28, "supervis": 16, "synthet": 5, "system": 25, "t": [], "tag": 28, "tak": [], "task": [9, 16, 19], "technic": 25, "techniqu": 28, "test": 12, "text": [2, 5, 11, 12, 21, 28], "thi": 7, "thought": 19, "through": 11, "tip": 28, "token": 21, "tool": 19, "toward": 28, "train": [4, 8, 24, 28], "transfer": 19, "transform": [], "trust": 20, "tune": 8, "turn": [23, 25], "tutoir": [], "tutori": [7, 10, 18], "type": [6, 9], "umbrella": [], "under": [], "understand": 24, "univers": 16, "up": [4, 24], "us": 19, "vocabulari": 27, "we": [4, 7, 24], "weak": 16, "what": [4, 7, 22, 24, 28], "why": [7, 16], "written": 5, "y": 16, "youtube8m": [], "yt8m": [], "z": 16, "zero": 19}}) \ No newline at end of file +Search.setIndex({"alltitles": {"1. Natural Langauge is (almost) universal label (y), task (z) encoder.": [[16, "natural-langauge-is-almost-universal-label-y-task-z-encoder"]], "2. Natural Langauge is (weak but scalable) supervision for representation learning": [[16, "natural-langauge-is-weak-but-scalable-supervision-for-representation-learning"]], "3. Natural Langauge is Human Friendly interface.": [[16, "natural-langauge-is-human-friendly-interface"]], "About the Authors": [[15, "about-the-authors"]], "Abstract Musical Controls": [[2, "abstract-musical-controls"]], "Adapted LLMs": [[8, "adapted-llms"]], "Adaptive Modulation/Normalization": [[21, "adaptive-modulation-normalization"]], "Advances": [[19, null]], "Aligning Language Models with Human Feedback": [[19, "aligning-language-models-with-human-feedback"]], "Apply Text Augmentation Techniques": [[28, "apply-text-augmentation-techniques"]], "Architecture": [[11, "architecture"]], "Architectures": [[8, "architectures"]], "Audio Diversity and Quality": [[12, "audio-diversity-and-quality"]], "Audio-Sentence Joint Embedding": [[28, "audio-sentence-joint-embedding"]], "Audio-Tag Joint Embedding": [[28, "audio-tag-joint-embedding"]], "Audio2Audio Controls": [[2, "audio2audio-controls"]], "Autoregressive Language Models": [[21, "autoregressive-language-models"]], "Background": [[17, null]], "Benchmarks": [[6, "benchmarks"]], "Beyond Audio Modality": [[1, null]], "Beyond Text-Based Interactions": [[2, null]], "Beyond semntica attributes, toward handle similarity queries": [[28, "beyond-semntica-attributes-toward-handle-similarity-queries"]], "Bibliography": [[0, null]], "Chain-of-Thought Reasoning of Language Models": [[19, "chain-of-thought-reasoning-of-language-models"]], "Challenges": [[20, null], [23, null]], "Channel Concatenation": [[21, "channel-concatenation"]], "Code Practice": [[4, null], [24, null]], "Code Tutorial": [[10, null]], "Conclusion": [[3, null]], "Conclusion \ud83c\udf89": [[4, "conclusion"], [24, "conclusion"]], "Conditioning": [[11, "conditioning"], [21, "conditioning"]], "Conditioning and Fusion": [[8, "conditioning-and-fusion"]], "Connecting Music Audio and Natural Language": [[15, null]], "Conversational Music Description": [[9, "conversational-music-description"]], "Conversational Retrieval": [[25, null]], "Datasets": [[5, null]], "Diffusion Model-based Text-to-Music Generation": [[11, null]], "Diffusion: Continuous Generation through Iterative Refinement": [[11, "diffusion-continuous-generation-through-iterative-refinement"]], "Distillation of Language Models": [[19, "distillation-of-language-models"]], "Early Stage Retrieval Methods": [[27, "early-stage-retrieval-methods"]], "Early Stage of Music Annotation and Retrieval": [[17, "early-stage-of-music-annotation-and-retrieval"]], "Early Stage of Music Generation": [[17, "early-stage-of-music-generation"]], "Employ Strategic Negative Sampling": [[28, "employ-strategic-negative-sampling"]], "Encoder-Decoder Attention (a.k.a. Cross Attention)": [[21, "encoder-decoder-attention-a-k-a-cross-attention"]], "Encoder-Decoder Models": [[8, "encoder-decoder-models"]], "Evaluation": [[6, null], [12, null], [26, null]], "Fr\u00e9chet Inception Distance (FID/FAD)": [[12, "frechet-inception-distance-fid-fad"]], "Future Directions": [[25, "future-directions"]], "Getting Started": [[15, "getting-started"]], "History": [[13, "history"]], "Human-written text": [[5, "human-written-text"]], "Implementing Language Models": [[21, "implementing-language-models"]], "Inception Score": [[12, "inception-score"]], "Inference & Make Retrieval Engine": [[24, "inference-make-retrieval-engine"]], "Initialize with Pre-trained Models": [[28, "initialize-with-pre-trained-models"]], "Introduction": [[4, "introduction"], [7, null], [13, null], [22, null], [24, "introduction"], [27, null]], "Key Benefits of Conversational Retrieval": [[25, "key-benefits-of-conversational-retrieval"]], "Key Technical Challenges": [[25, "key-technical-challenges"]], "Langauge Models": [[18, "langauge-models"]], "Language Models as a Framework": [[21, "language-models-as-a-framework"]], "Let\u2019s Get Started! \ud83d\ude80": [[4, "let-s-get-started"], [24, "let-s-get-started"]], "Leverage Diverse Training Data Sources": [[28, "leverage-diverse-training-data-sources"]], "Limitation": [[12, "limitation"]], "Limitations": [[6, "limitations"], [19, "limitations"]], "Limitations of Single-Turn Systems": [[25, "limitations-of-single-turn-systems"]], "Listening Test": [[12, "listening-test"]], "MOS Test (Mean Opinion Score)": [[12, "mos-test-mean-opinion-score"]], "MUSHRA Test (Multiple Stimuli with Hidden Reference and Anchor)": [[12, "mushra-test-multiple-stimuli-with-hidden-reference-and-anchor"]], "Masked Language Models": [[21, "masked-language-models"]], "Match-based metrics": [[6, "match-based-metrics"]], "Metric Learning Loss Functions": [[28, "metric-learning-loss-functions"]], "Models": [[8, null], [28, null], [28, "id3"]], "Motivation & Aims": [[15, "motivation-aims"]], "Multi-modal Joint Embedding Model Architecture": [[28, "multi-modal-joint-embedding-model-architecture"]], "Multimodal AR Models": [[8, "multimodal-ar-models"]], "Multimodal Decoders for Language Model Outputs": [[19, "multimodal-decoders-for-language-model-outputs"]], "Multimodal Encoders for Language Model Inputs": [[19, "multimodal-encoders-for-language-model-inputs"]], "Music Captioning": [[9, "music-captioning"]], "Music Classification": [[9, "music-classification"]], "Music Description": [[18, "music-description"]], "Music Generation": [[18, "music-generation"]], "Music Question Answering": [[9, "music-question-answering"]], "Music Retrieval": [[18, "music-retrieval"]], "Music description datasets.": [[5, "description-datasets"]], "Music description models.": [[8, "description-models-table"]], "MusicGEN": [[14, null], [14, "id4"]], "Natively Multimodal AR Models": [[8, "natively-multimodal-ar-models"]], "Neural Audio Codec": [[14, "neural-audio-codec"]], "Overview of Tutorial": [[18, null]], "Overview of this tutorial section": [[7, "overview-of-this-tutorial-section"]], "Performance & Efficiency": [[20, "performance-efficiency"]], "Precision and Recall": [[26, "precision-and-recall"]], "Prefix Conditioning": [[21, "prefix-conditioning"]], "Prerequisites": [[4, "prerequisites"], [24, "prerequisites"]], "Problem Definition": [[13, "problem-definition"]], "Problem: Out of Vocabulary": [[27, "problem-out-of-vocabulary"]], "Query-Caption Distribution Mismatch": [[23, "query-caption-distribution-mismatch"]], "References": [[5, "references"], [6, "references"], [8, "references"], [9, "references"], [17, "references"], [18, "references"], [23, "references"], [25, "references"], [27, "references"], [28, "references"]], "Representation": [[11, "representation"]], "Representation: Text as Sequence of Tokens": [[21, "representation-text-as-sequence-of-tokens"]], "Resources for Further Learning \ud83d\udcda": [[24, "resources-for-further-learning"]], "Resources \ud83d\udcda": [[4, "resources"]], "Results \ud83d\udcc8": [[4, "results"]], "Retrieval-Augmented Generation (RAG)": [[19, "retrieval-augmented-generation-rag"]], "Scaling Laws of Language Models": [[19, "scaling-laws-of-language-models"]], "Single-Turn Retrieval Limitations": [[23, "single-turn-retrieval-limitations"]], "Stable Audio Open Tutorial": [[10, "stable-audio-open-tutorial"]], "Step 1: Setting Up Our Environment": [[24, "step-1-setting-up-our-environment"]], "Step 1: Setting up our environment": [[4, "step-1-setting-up-our-environment"]], "Step 2: Loading the data \ud83d\udcca": [[4, "step-2-loading-the-data"]], "Step 2: Understanding the Data \ud83d\udcca": [[24, "step-2-understanding-the-data"]], "Step 3: Creating Our Dataset Class \ud83c\udfa8": [[24, "step-3-creating-our-dataset-class"]], "Step 3: Creating our dataset class \ud83c\udfa8": [[4, "step-3-creating-our-dataset-class"]], "Step 4: Building & Training Our Model Architecture \ud83c\udfd7\ufe0f": [[24, "step-4-building-training-our-model-architecture"]], "Step 4: Building and training our model \ud83c\udfd7\ufe0f": [[4, "step-4-building-and-training-our-model"]], "Synthetic Text": [[5, "synthetic-text"]], "Tasks": [[9, null]], "Text Relevance": [[12, "text-relevance"]], "The Framework": [[21, null]], "The axes of music description": [[7, "the-axes-of-music-description"]], "Tips for Training Audio-Text Joint Embedding Models": [[28, "tips-for-training-audio-text-joint-embedding-models"]], "Tool Use and Function Calling": [[19, "tool-use-and-function-calling"]], "Transfer Learning from Language Models": [[19, "transfer-learning-from-language-models"]], "Trust & Safety": [[20, "trust-safety"]], "Types of music captioning": [[9, "types-of-music-captioning"]], "What We\u2019ll Build": [[24, "what-we-ll-build"]], "What are language models?": [[22, "what-are-language-models"]], "What is music description? And why do we need it?": [[7, "what-is-music-description-and-why-do-we-need-it"]], "What is the Benefit of Joint Embedding?": [[28, "what-is-the-benefit-of-joint-embedding"]], "What we will build": [[4, "what-we-will-build"]], "Why Natural Langauge?": [[16, null]], "Zero-shot Task Transfer and In-Context Learning": [[19, "zero-shot-task-transfer-and-in-context-learning"]]}, "docnames": ["bibliography", "conclusion/beyondaudio", "conclusion/beyondtext", "conclusion/intro", "description/code", "description/datasets", "description/evaluation", "description/intro", "description/models", "description/tasks", "generation/code", "generation/diffusionmodel", "generation/evaluation", "generation/intro", "generation/lmmodel", "intro", "introduction/advantange", "introduction/background", "introduction/overview", "lm/advances", "lm/challenges", "lm/framework", "lm/intro", "retrieval/challenge", "retrieval/code", "retrieval/conversational_retrieval", "retrieval/evaluate", "retrieval/intro", "retrieval/models"], "envversion": {"sphinx": 62, "sphinx.domains.c": 3, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 9, "sphinx.domains.index": 1, "sphinx.domains.javascript": 3, "sphinx.domains.math": 2, "sphinx.domains.python": 4, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx.ext.intersphinx": 1, "sphinxcontrib.bibtex": 9}, "filenames": ["bibliography.md", "conclusion/beyondaudio.md", "conclusion/beyondtext.md", "conclusion/intro.md", "description/code.ipynb", "description/datasets.ipynb", "description/evaluation.md", "description/intro.md", "description/models.md", "description/tasks.md", "generation/code.ipynb", "generation/diffusionmodel.md", "generation/evaluation.md", "generation/intro.md", "generation/lmmodel.md", "intro.md", "introduction/advantange.ipynb", "introduction/background.md", "introduction/overview.md", "lm/advances.md", "lm/challenges.md", "lm/framework.md", "lm/intro.md", "retrieval/challenge.md", "retrieval/code.ipynb", "retrieval/conversational_retrieval.md", "retrieval/evaluate.md", "retrieval/intro.md", "retrieval/models.md"], "indexentries": {}, "objects": {}, "objnames": {}, "objtypes": {}, "terms": {"": [0, 2, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27, 28], "0": [4, 5, 9, 10, 11, 12, 16, 24, 26, 28], "00": [], "000": 22, "000061": 10, "00006103515625": 10, "00341": [0, 18], "0050b2820a1e709ffa623f9a9e8ae42d0903535f2150613cbfeb7f16932a": [], "00512": [], "0083": 24, "00830": [], "0092": 4, "00927": [], "01": [0, 9], "01095": [], "01103": 0, "01324": [], "01337": [0, 6, 8], "01420": [0, 5], "01546": [], "01618": [], "01626": [], "01652": [0, 18], "01733": [], "01840": [], "019": [], "01917": 0, "02": [0, 4, 8, 24], "021c1d407befb505791764ad2cbd56ceaaa53a746baed01d2e2143f05f18": [], "02252": 0, "02257": 0, "02696": [], "03": [0, 9], "03458": [], "03499": [0, 17], "03739": [], "03748": [0, 28], "03917": [], "04": [0, 5, 8, 9], "04208": [], "04378": [], "04628": [], "04658": [], "04805": [0, 18, 28], "04868": [0, 8], "05": [], "05011": 0, "05224": [], "056d58b606731f94fe395266c604ea9efcecc10e6857ceb9b10e6831d746": [], "0577": 4, "0583": 4, "0586": 4, "0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822": [], "05967": [], "06": [0, 9], "06125": [], "06174": [], "06178": 0, "0686": [], "07": [0, 5, 8, 9, 24], "0702": 4, "07069": [0, 18], "07160": [0, 8, 9, 18], "07439": [0, 23, 25], "07724": [], "07837": [0, 17], "07848": [], "0791": 4, "07919": [0, 8], "08": [0, 6, 8], "08070": [], "08384": 0, "08466": [], "08667": 0, "08691": [], "08774": [0, 18], "08803": [], "09": [0, 6], "0933": 4, "09636": [], "0984": 4, "0a": [], "0a0": [], "0a1": [], "0b": [], "0da8e798b168": 4, "0dfc83e0fe455cfe6272b23a65039b4101c63a4e7446801e26178b675fbf": [], "0ea5e3611e0b63766a56f81e7bc5cfa05c52e3a3f0b8d66b25c7262aeda": [], "0m": [], "1": [0, 2, 5, 6, 8, 9, 10, 11, 12, 15, 18, 19, 26, 28], "10": [0, 4, 5, 6, 8, 9, 10, 15, 24, 28], "100": [4, 12, 24], "1000": 10, "10057": [0, 5, 23], "10191775": [0, 9], "1024": [4, 24], "1025": [0, 23], "10301": [], "1032d0dbc2152c45f3d1e582a72e68f41898de9665202392d9400dfa329d": [], "1038": [], "10447027": [0, 5, 8], "1045": [0, 23], "104e9f575c27679ffedf994e53e6ac39067a0e77b2ea0d1567d4738686": [], "106": [], "1068": 0, "1076": [0, 9], "1077": 0, "10789": [], "10790": [0, 8], "10828fb40dcf097d1af84c1f2f863bae4046d5949450bf95b3260f767672": [], "109": [0, 6], "10970": 0, "10f97f73544edcdef54409f1d839f6049a0d79df68adbc1ceb24d1aaca42": [], "11": [0, 6], "1109": [0, 5, 8, 9], "1116": 0, "1120": 0, "11255": [0, 5, 8], "11305": 0, "11315": 0, "11325": [0, 5, 18], "113k": 5, "114": [], "11401": [0, 8, 9], "11415": [0, 8, 9], "1141a8232723dcb10a595cc0ce4321dcbbd5215300bf4acfc142343205bf": [], "1146": 24, "11489": [0, 25], "11498": [0, 28], "114m": [], "115": [], "1165": 4, "11692": [0, 6, 28], "11757": [], "1180": 0, "11834": [0, 8], "1186": [], "1188": 0, "11994": [], "11k": 5, "12": [0, 16], "12015": [], "1208": [0, 9], "120bpm": 16, "121": [], "1212": [0, 9], "12179": [], "12207897848a653d03ebbf6775a29d949408ded5f99b2d87198bc5c93508": [], "12208": [0, 18, 28], "12415": [0, 18, 28], "125": 0, "125817600": 4, "12661": [], "12662": 0, "1267": 4, "128": [4, 24], "12839": [], "13": [0, 16], "130bpm": 16, "13218": [], "133": [0, 8], "13301": [], "13438": 0, "13569": [0, 28], "1362": 0, "13686": [], "1371": 0, "13731": [], "14": [4, 15], "140": [0, 10, 18], "1412": [], "14167": [], "142": [0, 8], "1426": 4, "14358": 0, "1446": 4, "14784": [0, 5], "14793": [0, 5], "1481": 4, "14867": [], "149": [], "14rn7hpkvk": [0, 8], "15": 28, "150": 16, "15018": [], "150k": 8, "1514580907b0bac0970415e5e24ef96a9c1fa71dcf2aa0139045b58fae9a": [], "1534": 0, "15573": [0, 6, 8], "156": [], "15885": [], "16": [0, 8, 12, 13, 16, 17, 18, 27], "1601": [4, 24], "16020": [0, 6], "1604": [], "1608": [0, 8], "1609": [0, 17], "1612": [0, 17], "162": 0, "163": [], "16322": [], "16372": [], "16501": [0, 8, 9], "16512": [0, 8, 9], "1679": 24, "16798": [0, 9], "17": [0, 12, 13, 14, 17], "17042": [], "17162": [], "173": 0, "179": [], "179dd1bf8fd6bd689f0907f4baed557d2b12d2cf3d7ed1a8ecefe0a63d83": [], "17a": 0, "17b": [0, 13], "17th": [0, 8], "18": [0, 13, 17, 18], "1802": [], "1805": [], "1807": [0, 28], "1810": [0, 18, 28], "1812": [], "18407": [], "18503": [], "18653": [0, 6, 8, 9], "1869": 4, "1874": 4, "18754": 24, "18828": [], "18th": 0, "19": [0, 13, 18], "1907": [0, 28], "19159": [], "1937": [], "194": 0, "1950": [], "19512": [], "1964": [], "1970": 13, "1975": [], "1979": [0, 6], "1982": 0, "1983": 0, "1989": 0, "1990": 13, "1992": [], "1998": [0, 6], "19d5ff584cb58f654d22d8d6552d7c2fff7b85e4a9d525357f62a4d1e7e0": [], "1a": [], "1b69b697fe067d51219cfd64d0712bcbbce3b187389cb0793d9844ec14b1": [], "1bdb57a072903b222b1a745aa634cb845ff5f52a88ddd5ed1640ecf30beb": [], "1c": [], "1d": [11, 14], "1e": [4, 24], "1f": [], "1f0a22a6bcdd3fc26c73f63a025d05bd565901b729d56bcb093c722a6c4c": [], "1k": [], "1m": 11, "2": [0, 2, 3, 5, 6, 8, 10, 11, 15, 17, 18, 19, 27], "20": [0, 8, 12, 13, 15, 18], "200": 27, "2000": 17, "2001": 0, "2002": [0, 9], "2003": [0, 9], "2005": [0, 17, 18, 27], "2007": [0, 17], "2008": [0, 17, 18, 27], "2009": [], "2010": [0, 9, 17, 23], "2012": [], "2013": [], "2014": [], "2015": 13, "2016": [0, 8, 17], "2017": [0, 9, 17], "2018": [0, 13, 17, 18, 28], "2019": [0, 17, 18, 28], "202": 0, "2020": [0, 13, 18], "2021": [0, 8, 9, 11, 15, 18, 28], "2022": [0, 8, 9, 18, 28], "2023": [0, 5, 8, 9, 18, 23, 25, 28], "2024": [0, 5, 6, 8, 9, 15, 18, 23, 25, 28], "20445": [0, 5, 8, 9], "207": [], "20a": 0, "20b": [0, 13], "20xx": [], "21": [0, 6, 8, 9, 11, 16, 18, 28], "2104": [], "2109": [0, 18], "2110": [], "2111": 0, "214": [], "21450": 0, "21474": 0, "2161": [0, 9], "21783": [], "21th": 0, "22": [0, 8, 13, 16, 18, 28], "2202": [], "2204": [], "2205": 0, "22050": [4, 24], "2206": [], "2208": [0, 18, 28], "2210": 0, "2211": [], "2226": 0, "2231": 4, "2234": 0, "22a": [], "22b": [], "22k": 5, "23": [0, 2, 5, 8, 9, 11, 12, 13, 18, 23, 25, 28], "2301": [0, 5, 18, 25], "2302": 0, "2303": [0, 18], "2304": [], "2305": [0, 8], "2307": [], "2308": [], "231": [0, 5, 8, 9], "2310": [0, 8, 9, 18], "2311": [0, 5, 8, 18, 23], "2312": [], "2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324": [], "2350": 0, "2354": 0, "2358": 4, "237m": [], "238": 0, "2392": [0, 9], "2396": [0, 9], "23a": [0, 11], "23b": [0, 13], "23ef2fd02913d65d43dc7516fc829af709314a66c6f0bdc2e361fdcecc2d": [], "24": [0, 2, 5, 6, 8, 9, 11, 13, 14, 18, 23, 25], "2401": [], "2402": 0, "2403": [], "2404": [0, 28], "2405": [], "2406": [0, 6], "2407": [0, 5, 8, 9], "2408": [0, 6, 8], "2409": [0, 28], "2410": [0, 6, 8], "2411": [0, 6, 23, 25], "249": [], "24963": [0, 8], "24a": 0, "24b": 0, "24th": 0, "25": [0, 18], "25bcf75e373412daf1fd88045ab3aa8140a0d804ef0e70712c4f2c5b94d8": [], "25h": [], "25hcollect": [], "25hdownload": [], "25hrequir": [], "25l": [], "25th": [0, 6, 8, 15], "26": [0, 8], "26045404a30c8a200e960fb54fbaf4b73d12e58cd28e03b306b084253f4f": [], "262145": 24, "263": [], "264k": 5, "265": 10, "266": [], "27": [], "2713830": [0, 9], "273186269": 0, "2754": [], "2764": [], "2788": 24, "28": 0, "28492": [], "28518": [], "286": [0, 5, 8], "287": [], "2880": 0, "2894": 0, "28k": 5, "28th": 0, "29": 5, "290": [0, 5, 8], "2919": 24, "293": [0, 9], "2971": 24, "2a": [], "2a3e3df732393fed8b3ebf2ec078f05546de641fe1b667ee316ec1dcf3b7": [], "2b": [], "2c": [], "2d": 11, "2d1c0ebfd092e25935b86509a9a817159212d82aa43d7fb07eca4eeff2c2": [], "2d231b35456506b7c98b3ab9bbf07917b205fed8615d2e59e976ab497fff": [], "2d512efdb0de203d1f0312fae53433c3009ba70b0078421d25baaedc960a": [], "2e": [], "2eb3cd785efd67806c46c13a17339708ddc346cbb684eade7a6e6f79536a": [], "2f": [], "2k": 5, "2m": 5, "2min": 5, "2ugen": [], "3": [0, 3, 5, 6, 8, 9, 10, 15, 17, 18, 19], "30": [5, 9, 10, 11, 22], "300": [4, 24], "302": [0, 9], "30aa32745af16af0a9a650115fbe81bde7c610ed5c21b381fca0196f3a7f": [], "31": [], "3122": 4, "313": 0, "3169": [], "317": [], "31884": [4, 24], "319": [], "31k": 8, "31m": [], "31m1": [], "31m10": [], "31m108": [], "31m11": [], "31m12": [], "31m122": [], "31m13": [], "31m14": [], "31m141": [], "31m15": [], "31m16": [], "31m17": [], "31m172": [], "31m191": [], "31m2": [], "31m3": [], "31m4": [], "31m470": [], "31m493": [], "31m5": [], "31m6": [], "31m7": [], "31m742": [], "31m768": [], "31m796": [], "31m8": [], "31m834": [], "31m836": [], "31m837": [], "31m839": [], "31m845": [], "31m848": [], "31m849": [], "31m85": [], "31m853": [], "31m855": [], "31m860": [], "31m861": [], "31m868": [], "31m872": [], "31m874": [], "31m878": [], "31m884": [], "31m890": [], "31m897": [], "31m9": [], "31m900": [], "31m904": [], "31m913": [], "31m918": [], "31m920": [], "31m921": [], "31m925": [], "31m937": [], "31m942": [], "31m947": [], "31m949": [], "31m95": [], "31m973": [], "31m978": [], "31m982": [], "31m995": [], "31merror": [], "32": [0, 4, 9, 11], "324": 0, "326": 0, "32767": 10, "32m0": [], "32m1": [], "32m10": [], "32m106": [], "32m11": [], "32m112": [], "32m12": [], "32m121": [], "32m122": [], "32m13": [], "32m14": [], "32m143": [], "32m149": [], "32m15": [], "32m16": [], "32m162": [], "32m163": [], "32m17": [], "32m174": [], "32m179": [], "32m18": [], "32m19": [], "32m2": [], "32m20": [], "32m207": [], "32m21": [], "32m214": [], "32m22": [], "32m23": [], "32m24": [], "32m25": [], "32m26": [], "32m266": [], "32m27": [], "32m28": [], "32m287": [], "32m29": [], "32m3": [], "32m30": [], "32m31": [], "32m317": [], "32m319": [], "32m32": [], "32m33": [], "32m333": [], "32m34": [], "32m35": [], "32m36": [], "32m368": [], "32m37": [], "32m38": [], "32m389": [], "32m39": [], "32m392": [], "32m399": [], "32m4": [], "32m40": [], "32m41": [], "32m42": [], "32m43": [], "32m434": [], "32m44": [], "32m45": [], "32m46": [], "32m47": [], "32m48": [], "32m481": [], "32m49": [], "32m5": [], "32m50": [], "32m51": [], "32m519": [], "32m52": [], "32m53": [], "32m54": [], "32m55": [], "32m56": [], "32m563": [], "32m59": [], "32m6": [], "32m60": [], "32m61": [], "32m614": [], "32m616": [], "32m63": [], "32m64": [], "32m7": [], "32m71": [], "32m727": [], "32m73": [], "32m76": [], "32m77": [], "32m774": [], "32m78": [], "32m8": [], "32m81": [], "32m87": [], "32m890": [], "32m899": [], "32m9": [], "32m90": [], "32m92": [], "32m94": [], "33": [], "331": 0, "333": [], "33437": 24, "33k": 5, "34": 0, "3479": 4, "34th": 0, "35": [], "3523": 24, "3572": 24, "35th": 0, "36": [], "360": [], "3643": [0, 5, 8, 9], "3655": [0, 5, 8, 9], "368": 0, "36m": [], "36m0": [], "37": [], "3727": 24, "375": 0, "38": [], "39": 4, "392": [], "39c7c0d87f8d4e6c020a393182060eaefeeae6c01dab6a84ec346f2567df": [], "3a": [], "3af39d34be01a24a6e65433d19e107099374224905f1e0cc6bbe1fd22a2f": [], "3b": [], "3b00ac340a1aab3389ebcc52c779914a44aadf7b0cb7a3bf053195735607": [], "3c": [], "3d": [], "3e": [], "3f": [], "3k": [4, 24], "3m": 10, "4": [0, 3, 5, 6, 7, 10, 15, 16, 18], "40": [], "41": 0, "42": [0, 16, 28], "43": [], "434": [], "435d5d7ec64d1c8b422ac9ebe42d2f3b2ac0b3f8a56f5c04dd0f3b7ba83c": [], "4361": 0, "4370": 0, "44": 11, "440": [], "4407": [0, 9], "44100": 10, "45": [4, 24], "4524": 4, "454d6e7f0158951d8a78c2e1eb4f69ae81beb8dca5fee9809c6c99e9d0d0": [], "456": 0, "4583": [0, 28], "4587": [0, 28], "46": [], "460": 0, "46649": 24, "467": [0, 17, 18, 27], "46th": [0, 18, 25], "47": [], "476": [0, 17, 18, 27], "48": [], "48072": 24, "48550": [0, 6, 8], "4868": 24, "49": [], "4b": [], "4c": [], "4c4672025c23a305231a81bf492f65aa3ea0965a89f9ca369a9ee7d47fd9": [], "4d": [], "4e": [], "4f": [4, 24], "4f639c1168d7aada749a896afb4892a831e2041bebdcf636aebfe9e86556": [], "4o": 19, "5": [0, 3, 4, 5, 9, 10, 12, 16, 18, 23, 24, 25, 28], "50": [4, 10, 27], "500": 10, "5063": 24, "51": 5, "519": [], "52": [], "521": [0, 8], "5244": 24, "525": [], "53": [0, 18], "5302": 24, "531": [0, 17], "534": [0, 17], "53k": 5, "54": [], "540": [], "541": [], "55": [], "5593a40fcd0981bda85274bb3e622ac433a94ae1e11ef8639de362cfa7d": [], "55bpm": 16, "55cdeed5889f2076fdb125bc87bb7ab0f1715c84b0a4619c44833d890f60": [], "56": [0, 28], "564beb0c78bf83018a146dfcdc959c99c10a0d136480b932a350c852adbc": [], "566": [], "57": [], "5730cc60bf438b56438756e45ac469c01bcf9c47d87632c468623167b7f": [], "5781": 4, "58": [], "580600f441f6fc05218bd6c9d5794f4aef072a7d9093b291f1c50a9db8bc": [], "58b70a580de00893223d61de8fea167877a3aed97d4a5e1405c9159ef925": [], "58d71f2041bc89919f56a69f8f2b9535a55d513bb005fbe4f8ee5d367170": [], "59": [], "591": [0, 28], "595": [0, 28], "5a": [], "5a36494314e4780362b15a7e190095eec68366a0d512b5b532607c213a26": [], "5af6804c4cc0fed83f47bff6e413a98a36618e7d40185cd36e69737f3b0": [], "5b": [], "5c": [], "5d": [], "5e": [], "5f30aea01532961bab043775258b06484f2a57530a88940e4cc3aea4f1f1": [], "5k": 5, "5min": 5, "6": [4, 5, 16, 24, 28], "60": [], "607": [], "608": 10, "609961972f694cb9520c4c3d201e377a26583e1eb83bc5a334c893729214": [], "60cd92bd3ec00948800984410f4cf5ded5bd8e9b715729f3642efe0edb3d": [], "61": [0, 23], "616": [], "61b627404c2d6f31dcbc491ff83da1f4336c7ae7893cfdc6c52db490ec59": [], "621": [], "6262": 4, "62nd": [0, 6, 8], "63": [], "6307fbef88d9b5ee7421e68d78a9f162e0da4900bc5f5793f6d3d0e34fb8": [], "64": 11, "6402242dde160d9ef9903487b4277443dc3da04615f6c4d3b48564a8ab57": [], "65": [], "66": [], "661": [], "6626": 0, "6637": 0, "67": [0, 18], "671c0e1f2572ba625cbcc1faeba9435e00330c3d6962858711445cf1e817": [], "6724805521ab4e723a12182f92374031032aff28a8a89dc8505c52b79032": [], "6742ef9206409d5ce1fdf44d5ca1687cdc3847ba0485424e2c731e6bcf67": [], "67ebd9d6ce9e65747e720c4c5614cd3a137e61340aec274657fcd9cc5162": [], "68": [], "6809": 24, "681": [], "684": [], "69": [], "693": [], "6980": [], "6a": [], "6b": [], "6d": [], "6e": [], "6e30b6b0cc0c18f8eb566e4f440e8127d9dad32bcaa70d38c8c44a21e62d": [], "6e9f9b41c48750a45ad07cc6d43a2979bfc09e6989656aece97cc59cbef1": [], "6f": [], "7": [4, 5, 10, 16, 24, 25], "70": [0, 18], "7047": 24, "71": [], "72": [], "72a58cb3b241d869811be4f9328a37f1563dc9c48af8c0467cb681f9ed46": [], "73": [], "74": [], "75718504a1bf0562e7e02def34cfc9bb274b6f284773cbeeeba0767a31b": [], "75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40": [], "76": [], "7616": 0, "7633": 0, "768": 24, "77": 0, "774": [], "7762": [0, 8], "7770": [0, 8], "77cc11c7a9ea9fd05503def69e3d18605852cd0d4b0d3b8f15bbeb3ef1d1": [], "77edf4c29c8d6728b49d3f0abb22159bb9c0c4ddebd721c09486b34985c8": [], "78": [], "784": [0, 8, 9], "78bd0e95dd2444b6caacbca2b730671d4295ccb628ef58b81bee903629df": [], "7907": 24, "7925": 24, "7952585": [0, 9], "7b": 5, "7b5a1a5419e400f715387a48f65225ec7a3f2104465f346fc75e8793407b": [], "7c": [], "7dcce24e978bc14a18e2a3f7e2d6f4d2001533dc0cffab143bb3f8ec13d6": [], "7e": [], "7f": [], "8": [0, 4, 5, 8, 9, 16, 18, 24, 28], "80": 0, "800560": [0, 9], "80370da514096c6190f8913668198380ea09c2d252cfa4e85a9c096d3b40": [], "804": 0, "807": 0, "80cc3315dd1ca706643b78f894901d4d888ffe376a5e401f73d9db61071": [], "81": [], "8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c": [], "81d47999aebc1b155f81eca4477a616a70f238a2549848c38983f3c22a82": [], "828": [], "83": [], "83871f3c50fc983b88547c196d11cf8c3340e37c32d2e9d6152abe2c61f7": [], "84": 0, "8462": 24, "85": [], "85249acbac630f34cd113dca4b1a72f55d3ad4c26bc9305a27aef6049756": [], "859": [0, 8], "86": [0, 9], "8630": 4, "8653ae6d18e20183fc6051fd2e10cd0c46e16a6b71eb34edef8d465dc969": [], "86bb218c7926e1da7a52e0696cab120a17c995933f08d8228d9aa83b44c5": [], "87": [], "8748": [0, 28], "8763": [0, 28], "88": [], "8821": [], "8831": [], "88k": 5, "89": [], "890a583cd3f2be27ecf32b479d5d615710bb926d92da03e3f7838ff3e58b": [], "899": [], "8a": [], "8b5d82fe2d9c7f260fb73121418f5e07d4e38c329ea3886a5b0e55586113": [], "8c": [], "8c75caed8f2462d63c7fd65e16c832b8f76cda331ac9e615e914ee80bac9": [], "8d": [], "8da8dd078b354a89602a875d310a0d725dad92b5b4d61069576e0a0e02e4": [], "8dd4d6de0fbba9d8f10d7b655be0578d5bda6e4db425210c265b0ea6c804": [], "8df4efa78df8b129847c8a7c0e492376cca62ab68453e5a20375a1c6291b": [], "8df927d3f0951cf67ca5973d89b35bcbda1777a4c78bf90a853d02d91285": [], "8e": [], "8f": [], "8f0c4a5bb9fd491c277c21eff7ccae71b47d43c4446c9d0c6cff2fe8c2c4": [], "8f8e631fcdc2ff978609eaeef1d6994bf2f028b59d9ac67640ed051f1218": [], "8k": 5, "9": [4, 5, 24, 28], "90": 4, "9048": 24, "9090": 4, "91": [], "917": [23, 25], "92": [], "9240": 24, "927e3a8899e52a27fa57a48607ff7dc91a9ebe97399b357b85a0c7892e00": [], "93": [], "9315": 4, "937": [0, 9], "9375917786cb39270b0ee6634536c0e22abf225825602688990d8f5c6c19": [], "9377bcb415797e44274b51d46e3249eba641711cf3348050f76ee7b15ffc": [], "93f7309eb40a9299c59a6637f13c21b08e585c569fee85901ccd55ce00f5": [], "94": [], "943": [], "94797cfe0263a30805f3074e535adfde02b885ac43d1e4dac85f82213b0b": [], "94c7dab8cfe7d41a23133634576fb89412e3430f28ca8d44411a77c2f18d": [], "95": [], "952": [0, 9], "953": [], "96": 11, "96142937f66150805c25c4d0f31ee4132fd33497753400734f9dfdcbdc66": [], "9637": [0, 8], "9662": [0, 8], "9748": 4, "98": [], "99": [], "9963d588cc3d75d766c819e0377a168ef83cf3316a92769971527a1ad1d": [], "9a": [], "9a683359ad2ed11b2303a7a94800db19c61d33fa3bde271df09e99936022": [], "9b": [], "9b2eab7833494e7c82f70c9b2f8e907d38231f4535704e3045a8a4960c8": [], "9c": [], "9cf1a409640adac045750b2ba9d1355c83942fbae74f21284c2133292be": [], "9eb14d4e9ef366be2020063d91c4f608294969fcd7b9fcc48153c64b9776": [], "9f1413bef53171f379d786aabc104d4abeea48ee84c553a3e3d8c9f96a9c": [], "9f1894efa1bb15e98613244b24dfbacfe2309e0ac3cfc27d4c608c2270d2": [], "9k": 5, "A": [0, 4, 5, 6, 8, 9, 11, 12, 17, 19, 21, 24, 26], "AND": 27, "AT": [], "And": [4, 11, 15], "As": [2, 3, 4, 6, 7, 8, 11, 14, 17, 19, 20, 22, 25, 27], "At": [8, 28], "BY": 5, "Being": [7, 20], "But": [4, 19, 21, 24], "By": [13, 24, 26, 28], "For": [2, 4, 6, 7, 8, 9, 11, 16, 19, 21, 22, 23, 24, 25, 27, 28], "If": [8, 9, 10, 11, 17, 19, 22], "In": [0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 27, 28], "It": [0, 4, 8, 18, 19, 21, 22, 24, 26], "Its": [4, 19], "NOT": 27, "No": [4, 10, 16, 24], "OR": 27, "Of": 8, "On": 2, "One": [8, 9, 11, 12, 28], "Or": [0, 4], "That": 21, "The": [0, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 27, 28], "Their": [25, 28], "Then": [11, 13], "There": [2, 11, 20, 21, 23], "These": [2, 4, 5, 6, 8, 11, 12, 13, 17, 19, 21, 23, 27, 28], "To": [2, 4, 6, 10, 11, 12, 15, 19, 23, 28], "Will": [], "With": [4, 15, 19], "_": [8, 11, 22, 24, 28], "_0": [8, 11], "_1": 26, "_2": 26, "__getitem__": [4, 24], "__init__": [4, 24], "__len__": [4, 24], "_brownian": 10, "_c": [], "_end": 10, "_get_default_devic": [], "_i": 8, "_n": 26, "_q": 26, "_t": [8, 11], "a1": [], "a2": [], "a2t_project": 4, "a3": [], "a39c835871caca0173f526e321336a1a2b0961e38bf9b71b7213b651e3c8": [], "a4": [], "a5": [], "a6": [], "a61ef6f7faf98edadf4ce8094873d298f8582a3ec59b65c9174c516926e8": [], "a6c031bc1590789a3da14bd6a9cccc46c932401765d6d8f37e75c8214b44": [], "a7": [], "a8": [], "a812df4e2dd5696d1f351d58b8fe16a405b234ad2886a0dab9183fb78109": [], "a_i": 28, "aa": [], "aaa": [0, 18], "aaai": [0, 15], "aaron": [0, 17, 28], "ab": [0, 6, 8, 10], "ab44c871b0f07f491e5d2ad12c9bd7358e527510618cb1b803a88e986db1": [], "abbeel": 0, "aberman": [], "abhimanyu": [], "abhinav": [], "abhishek": 0, "abi3": [], "abil": [0, 8, 11, 12, 16, 17, 28], "abl": [7, 11, 12, 21, 23], "ablat": 12, "about": [0, 2, 3, 4, 7, 8, 9, 11, 16, 19, 21, 22, 23, 26, 28], "abov": [8, 11, 13, 14, 16, 19, 21], "abraham": [], "absent": 23, "absl": [], "absl_pi": [], "abstract": [0, 3, 6, 7, 8, 9, 22], "abu": [0, 8, 9], "academ": [3, 15], "acceler": [0, 15], "access": [3, 15], "accompani": [0, 2, 4, 5, 9, 13, 24], "account": [4, 6, 10, 24], "accur": [2, 12, 13, 19, 28], "accuraci": [19, 26, 28], "achiam": [0, 18], "achiev": [3, 4, 8, 9, 13, 19], "acl": [0, 6, 8], "aclanthologi": [0, 5, 6, 8, 9], "acm": [0, 18, 25], "acoust": [0, 5, 8, 9, 16, 18, 23, 24, 28], "acquir": [], "across": [2, 13, 15, 19, 23, 25, 26, 28], "activ": [0, 8, 11, 15, 27], "actual": [11, 16, 23, 26], "ad": [6, 11, 17, 19], "adaln": 11, "adam": [0, 5, 17, 18], "adamw": [4, 24], "adapt": [9, 11, 12, 13, 19, 28], "adb": [0, 5, 13, 18], "add": [4, 11, 24], "add_special_token": 4, "addit": [2, 8, 9, 11, 12, 15, 19, 21, 23], "addition": [2, 12, 13, 16, 21, 25, 27, 28], "address": [2, 3, 8, 9, 18, 19, 20, 23, 25, 27, 28], "adi": [0, 18], "aditya": [0, 28], "adjust": 19, "adler": [0, 18], "admiss": 6, "adob": 15, "adobephotoshopsenseiarteam": [], "adopt": [6, 8, 11], "adpt": [], "advanc": [0, 3, 5, 7, 8, 12, 13, 15, 16, 17, 18, 20, 21, 23, 28], "advantag": [3, 9, 16, 17, 25, 28], "adversari": [0, 14, 17], "advis": 15, "ae": [], "ae30dadffc90b9006d77af76b393cb9dfbfc9629f339fc1574a1c52e6806": [], "aed7a284c00dfa7c0682d14df85ad4955a350a21d2e3b06d8240497359bf": [], "aeiou": [], "aesthet": [0, 9], "af": [], "af0d1f58f86002be0cf1e2665cdd6f7a4a71cdc8a7a9438cdc9e3b5375f": [], "affect": [10, 21], "after": [8, 10, 11, 19], "afternoon": 5, "again": 21, "against": [12, 26], "agarw": [0, 18, 28], "agent": [19, 20], "aggreg": [0, 6, 9], "aggress": 16, "agostinelli": [0, 5, 18], "agrawala": [], "ahead": 2, "ahm": [], "ahmad": [0, 18], "ai": [0, 8, 10, 15, 16, 20, 22, 24], "ai4cc": [], "aidan": 0, "aiesha": [], "aila": [], "aim": [3, 13, 17], "aiobotocor": [], "aiofil": [], "aiohappyeyebal": [], "aiohttp": [], "aioitertool": [], "aiosign": [], "air": [0, 6], "aiti": [0, 6], "aittala": [], "ajai": 0, "ajit": [], "aka": 11, "akash": [], "akhgari": [], "akhil": [], "akkaya": [0, 18], "aksan": [], "akten": [], "al": [4, 8, 9, 24, 25], "alaluf": [], "alan": [], "alban": [], "albert": 0, "album": 5, "alcap": [0, 8, 9], "alec": [0, 18, 28], "alejandro": 0, "alek": [], "aleksand": [], "aleman": [0, 18], "alex": [0, 17, 18], "alexand": [0, 8, 27], "alexandr": [0, 18], "alexei": [], "algorithm": [0, 10, 13], "ali": [], "alia": [], "alias_free_torch": [], "align": [0, 2, 6, 8, 9, 13, 21, 23, 28], "all": [0, 2, 3, 6, 8, 9, 11, 12, 15, 19, 21, 26, 27], "allow": [2, 8, 9, 11, 15, 16, 19, 24, 28], "allud": 19, "almeida": [0, 18], "almost": [11, 13, 21, 23], "alon": 0, "along": [5, 7, 11, 27], "alongsid": [6, 8, 9], "alpha": [], "alphabet": 20, "alreadi": [10, 21, 28], "also": [2, 3, 6, 7, 8, 9, 11, 12, 13, 15, 16, 19, 20, 21, 23, 25, 28], "altenschmidt": [0, 18], "altern": [5, 8, 17, 18, 19, 21], "although": [8, 16, 19, 21], "altman": [0, 18], "alwai": [6, 8, 19], "amanda": [0, 28], "amaz": 24, "amazon": 5, "ambient": 28, "ambuj": [], "america": [], "american": [0, 23], "ami": [], "amir": [], "amirmojtaba": [], "amit": [0, 5, 8, 9], "amodei": [0, 18], "among": [5, 6, 8], "amount": 19, "amp": 10, "amplitud": 19, "amu": [7, 9], "an": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "anaconda3": 10, "anadkat": [0, 18], "analogi": [0, 5], "analys": 7, "analysi": [0, 23, 25, 28], "analyt": 23, "analyz": [3, 26, 27], "anandkumar": [], "anchor": 28, "and82": [0, 11], "anderson": 0, "anderson2016": [], "andi": [], "andr": [0, 6, 8], "andrea": [0, 5, 18], "andrew": [0, 17, 18], "andrii": [], "angela": [], "angelo": 0, "anger": 27, "ani": [2, 3, 4, 10, 11, 19, 21, 22], "anil": [], "anima": [], "animesh": [], "anirudh": [], "anjali": [], "ann": 15, "anna": 0, "annot": [0, 5, 7, 16, 18, 23, 27, 28], "annotated_typ": [], "annual": [0, 6, 8], "anoth": [4, 8, 19, 21], "ansel": [], "answer": [0, 3, 5, 6, 8, 16, 19, 22], "anthem": 24, "anticipatori": [0, 13], "antoin": [0, 5, 18], "antonio": [], "anygpt": 8, "anyi": [], "anyio": [], "anyon": 4, "anyth": [2, 11, 19, 20, 21, 22], "anytorch": [], "aouameur": 0, "ap": 26, "apach": 5, "apart": 28, "api": [0, 10, 19], "appdir": [], "appear": [8, 21, 22], "append": [11, 24], "appl": 15, "appli": [5, 12, 13, 15, 19, 21], "applic": [0, 3, 7, 9, 15, 17, 18, 19, 20, 21, 22, 27], "appreci": 4, "approach": [0, 2, 3, 6, 8, 9, 11, 13, 15, 17, 19, 21, 22, 23, 24, 25, 27, 28], "appropri": [12, 17, 25, 26, 27], "approx": 11, "approxim": 11, "ar": [0, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 27, 28], "arab": [0, 8, 9], "arang": 24, "arash": [], "arbitrari": [21, 28], "arbor": 15, "architectur": [2, 3, 4, 7, 12, 13, 14, 16, 21, 22], "area": [12, 13, 15, 17, 19, 22, 25], "aren": [0, 5, 18, 19, 28], "argbind": [], "argpars": [], "aris": 3, "armi": [0, 6, 8], "around": [4, 8, 10], "arrai": [4, 24], "arrang": 28, "arriv": 19, "art": [0, 2, 8, 9, 13], "arthur": 0, "articl": 19, "articul": 13, "artifici": [0, 8, 15], "artist": [2, 13, 17, 23, 27, 28], "artsiom": [], "arun": [0, 18, 25], "arushi": [0, 8], "arxiv": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "ashish": 0, "ask": 19, "askel": [0, 28], "aspect": [8, 9, 12, 23, 28], "assess": [0, 6, 12, 26, 27], "assign": [9, 12, 13, 21, 28], "assist": 19, "associ": [0, 5, 6, 8, 9, 18, 27, 28], "assum": 8, "ast": 28, "asttoken": [], "atin": [0, 5, 8], "attempt": [8, 17, 23, 27], "attend": [19, 28], "attent": [0, 2, 4, 8, 11, 14, 19, 28], "attention_mask": [4, 24], "attr": [], "attribut": [16, 17, 18, 23, 27], "atzmon": [], "audio": [0, 2, 3, 4, 5, 6, 7, 8, 9, 11, 13, 17, 18, 19, 20, 23, 24, 27], "audio_2023": [], "audio_base64": 5, "audio_byt": 5, "audio_embedding_dim": [4, 24], "audio_forward": 24, "audio_html": 5, "audio_project": 24, "audio_sampl": 10, "audiobench": [0, 6], "audiogen": [0, 13], "audioldm": [0, 13], "audiolm": [], "audioread": [], "audioset": 5, "audiotool": [], "audit": [], "auditori": 18, "augment": [0, 5, 8, 9, 11], "august": [0, 6, 8], "auraloss": [], "authent": 10, "author": 9, "auto": [0, 9, 14, 18], "autocast_mod": 10, "autoencod": [0, 11, 17, 19], "autom": [17, 20], "automat": [0, 6, 7, 9, 15, 17], "automodel": [4, 24], "autonom": 20, "autoregress": [8, 11, 13, 19, 22, 28], "autoregresst": 13, "autosav": 10, "autotoken": 24, "auxiliari": 5, "av": [], "avaiabl": 10, "avail": [4, 10, 17, 19, 24], "avent": 0, "avenu": 2, "averag": [6, 12, 25, 26], "avoid": 10, "aw": [0, 6], "awai": [5, 6, 11, 24], "awar": 9, "ax": [], "axel": [], "axi": [7, 11, 16], "ayan": [], "ayh": [], "azalea": [], "b": [0, 10, 19], "b1": [], "b161908e2f51be56568184aeb4a880fd287178d176fd1c860d2217f41106": [], "b2": [], "b3": [], "b4": [], "b6": [], "b64encod": 5, "b67ebd7e19ffe259f05d3cf4547326725c3113d640c277030be3e9998d6f": [], "b7": [], "b8": [], "b86984bed139586d01532a587464b5805f12e397594f19f931c4c2fbfa61": [], "b9": [], "b95df0b8593aee5d9e68b9a9f24e83c69657afb46b24f83b57098d926401": [], "b9b800c45527aadd64d5b442f9b932b00648617eb5d63d2c7a6587b7cafc": [], "ba": [], "ba44652d562cbf0bf320e0f3810206149c8a4e99cdbf66da82e97ab53a15": [], "bach": [0, 13, 17, 18], "back": [10, 11, 13, 19, 20, 23, 27], "backbon": [8, 11], "background": 23, "backpropag": [], "backward": [4, 24], "bad": 2, "badlani": [0, 8], "bahjat": [], "bahri": [], "bai": [], "baid": [], "balaji": [], "balanc": [8, 26], "balog": [0, 18, 25], "bangkok": [0, 6, 8], "banjo": 24, "bao": [], "bar": [4, 11], "barn": [], "barret": [0, 18], "barrett": [], "barrington": [0, 17, 18, 27], "barron": [], "bart": 8, "barzilai": [], "base": [0, 3, 5, 7, 8, 9, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "base64": 5, "baselin": [9, 19, 28], "bash": 10, "basi": 5, "basic": [4, 17, 18, 19, 21, 22, 24], "bass": 28, "batch": [4, 10, 24, 28], "batch_siz": [4, 24], "bay": [], "bb9ff095ae7b1b6908480f683b6ca6b71c2105d343a5e5cb25334b01f5fa": [], "bc": [], "bd": [], "bdt": [], "be958fefa589186b54daaa9a72fa1a2e19e42a2dcab87ee15c8273259da0": [], "beach": 28, "beat": [0, 5, 18, 24], "beatl": 28, "beauti": 4, "becaus": [11, 12, 16, 17, 19, 20, 21, 22, 24, 26], "becom": [2, 8, 9, 13, 16, 19, 20, 21, 27], "beeler": [], "been": [2, 9, 10, 11, 12, 13, 15, 19, 21, 22, 23, 27], "befor": [10, 11, 12, 19, 27], "began": 13, "begin": [18, 21, 24, 28], "behav": 19, "behavior": [4, 19, 23, 24], "behind": [12, 19], "being": [3, 9, 11, 16, 19, 21, 23, 26, 28], "believ": 15, "bell": [], "below": [5, 6, 8, 10, 11, 19, 21, 26, 28], "belt": 4, "ben": 0, "bench": [0, 6], "benchmark": 0, "benefit": [7, 9, 15], "beneto": [0, 5, 6, 8, 9, 15, 18, 23, 28], "bengio": [0, 17], "benjamin": [0, 8], "benno": [0, 5, 6, 8, 23, 28], "benzi": [], "berard": [], "berg": [0, 8, 15, 18, 28], "bergman": [], "bermano": [], "bernard": [], "bernhard": 0, "bert": [0, 6, 11, 18, 21, 22, 24, 28], "bertin": [0, 17], "bespok": 2, "best": [2, 3, 11, 15, 19, 27, 28], "beta": [], "beta_": 8, "bethard": [0, 5, 8, 9], "better": [4, 6, 8, 9, 11, 19, 20, 22, 23, 24, 25, 26, 28], "between": [3, 6, 7, 8, 9, 12, 13, 15, 17, 19, 20, 23, 25, 26, 27, 28], "beyond": [0, 9, 17, 18, 19, 25, 26], "bhe23": [], "bi": 28, "bia": [8, 20, 24], "bian": [], "bias": [12, 20], "bichen": [], "bidirect": [0, 11, 18, 28], "big": [11, 19], "bigger": [19, 20], "biggest": 20, "bigvgan": [], "bilei": [], "billion": 19, "bin": [0, 6], "binari": [26, 27], "bing": [], "bingchen": [], "biomed": [], "bit": [8, 10, 11], "bittner": [0, 8, 9, 18], "bj": [], "bjd": [], "black": [], "blank": [19, 21, 22], "blap": [0, 8], "blattmann": [], "bleach": [], "blend": [4, 13, 24], "bleu": 6, "bleu_1": 6, "blob": 10, "block": [8, 10, 11, 14], "blocker": 20, "blog": [0, 8, 11, 18], "blown": 24, "blue": [5, 13, 16], "blurri": 19, "bmv": [], "bnh": [], "bo": [], "bockkschlut": [], "bockkw16": [], "bodganov": [0, 5, 23], "bodi": 2, "boesel": [], "bogdanov": [0, 6, 8], "bohan": [], "boissier": [], "bokeh": [], "boldsymbol": [8, 11], "bolei": [], "book": [3, 4, 15], "booktitl": [], "boolean": [18, 27], "boost": 2, "bootstrap": [0, 8], "borgeaud": [], "bori": 0, "borrow": 6, "borso": [0, 5, 18], "bos_embed": 4, "bos_token_id": 4, "bosma": [0, 18], "bot": 20, "both": [2, 4, 6, 8, 9, 11, 12, 19, 25, 26, 27, 28], "botocor": [], "bottleneck": [13, 14], "bottom": 21, "boyer": [0, 9], "bpe": [21, 28], "bpm": 10, "braceexpand": [], "brahma": [0, 18], "bram": [], "brandon": [0, 9], "brass": 5, "braun": [], "break": [0, 2, 4, 6, 8, 11, 24, 28], "breakthrough": 13, "breathtak": 4, "brebisson": [], "bresson": [], "breviti": 6, "brian": [0, 17, 18, 27], "bridg": [0, 3, 5, 8, 9, 13, 18, 27, 28], "briefli": [6, 16, 19], "bright": 16, "bring": 19, "broad": [2, 11, 13, 22], "broadcast": 21, "broader": [23, 27], "brockman": [], "broken": 8, "brook": [], "broomel": [], "brownian_interv": 10, "brows": 27, "browser": [10, 24], "brox": [], "brualla": [], "bruno": 0, "bryan": [0, 5, 8, 18], "bsv": [], "btyld23": [], "budget": [8, 19], "build": [2, 11, 15, 23, 25, 28], "built": [11, 24, 27, 28], "bulid": 24, "bunch": 19, "burcu": [], "burgeon": 15, "burovski": [], "byte": [5, 21, 28], "bytecod": [], "byted": 15, "c": [0, 6, 8, 11, 16, 17, 19], "c1": [], "c13ea695a4393639830bf96baea956538ba7a9d06fcce7cef10bfff20f72": [], "c188ac517f402775b90d6f312955a5e53b866c964b32119f2ed76315697": [], "c19819d5e3d95294a6f5947fb9b9629efb316b96de511b418c53d245aae6": [], "c2": [], "c316262244abea7481f95f1e91d7575f3dfcf6455d56d1ffe9839c582eb1": [], "c4": [], "c463dc5fc02fbe019566d067a9d18746cd3c664f29c9b8b3c3f9ed025365": [], "c4dm": [], "c5": [], "c6": [], "c691e6c5d925a364d63eec27d1f10477ca7902febe10a8e1f86284dba754": [], "c869a1fbd481dcb02c70032fd6a7243de7582bc48c7cae03d6f0985a11c0": [], "c8bfa8cbcd3ea1d25d2beb359b5c5a3f4339a7e2e5d9e3ef3e29ba3ab3b9": [], "c9b96572ab7994e73c64588f8875741823f2daba70e746547fff9a2d9a54": [], "ca": 15, "cacer": 0, "cach": [], "cacul": 12, "cai": [], "caillon": [0, 5, 18], "calcul": [12, 19, 21, 26], "california": [15, 28], "call": [4, 8, 10, 11, 16, 21, 22, 24, 25], "cambridg": [], "came": 28, "campaign": 20, "can": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 16, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "cancel": [], "candid": [6, 15], "cangea": [], "cannot": [9, 11, 17, 23, 27, 28], "cao": 0, "cap": [4, 24], "capabl": [0, 3, 8, 13, 15, 17, 19, 23, 25, 28], "capac": 4, "capit": 11, "caption": [0, 2, 4, 5, 6, 7, 8, 11, 15, 16, 18, 24, 28], "caption2emb": 24, "captiosn": [], "captiv": [4, 24], "captur": [4, 6, 7, 8, 9, 17, 22, 23, 24, 25, 28], "carbonneau": 0, "care": [8, 23, 25, 26], "carefulli": [20, 26, 28], "carlo": [], "carnovalini": [], "carol": [0, 5, 8, 9], "carr": 0, "carri": 8, "carrol": [0, 18], "casagrand": [], "cascad": 17, "case": [2, 5, 7, 8, 9, 10, 11, 12, 21, 23], "caseb": 0, "casei": [], "cast": 8, "casual": 5, "cat": [4, 16, 24], "catalog": 27, "catanzaro": [0, 8], "catchi": 24, "categor": [9, 21], "categori": [2, 9, 10, 11, 26, 28], "cater": 15, "caus": 11, "causal": [21, 28], "cb": [], "cc": [4, 5, 24], "cc3a402a6439c15c3d4294333e13042b915bbeab54edc457c723931fed3f": [], "ccf007edf442c3c0cd3a98be2c82bc99edc957c04436a759b6e1e01077e0": [], "cck": [], "cd": 15, "cd10c82398f3b39bbf60a300e09c931bdf6844f3f2fba9ab2b5981501f9f": [], "cdescrivan17": [], "cdot": [11, 28], "cdz": [], "ce": 0, "ce21": [0, 2], "ce6964e9f8822f6e63ebc59bdcc5ae445126b7356da63188fa0e6265054": [], "cell": [4, 10, 16, 24], "celma": [0, 17], "celso": [], "cem": [], "center": 4, "centr": 15, "central": [19, 26], "certain": [19, 21, 22], "certainli": 11, "certifi": [], "cf": [], "cffi": [], "cfg": 10, "cfg_scale": 10, "cfs16": [0, 8], "cfsc17": [0, 9], "chaganti": [0, 18, 25], "chakrabarti": [], "challeng": [3, 6, 7, 13, 15, 17, 18, 19, 22, 27, 28], "cham": [], "chan": [], "chanan": [], "chang": [0, 2, 6, 8, 18, 19, 22, 28], "changli": [0, 8], "changx": [], "changyou": [0, 8, 9], "channel": [2, 11, 17, 19], "chao": [0, 8], "chaowei": [], "chaoyu": [], "chapter": [3, 16, 18, 22, 23, 25], "character": [], "characterist": [4, 6, 8, 9, 23, 27, 28], "charli": [], "charset": [], "chartmetr": 15, "chat": [22, 25], "chatgpt": [16, 19, 22], "chatthe": [], "chaudhuri": [], "chauhan": [], "che23": [], "cheaper": 21, "cheapli": 19, "cheat": 12, "chechik": [], "check": [4, 10, 11, 16, 24], "chelsea": [], "chemistri": [], "chen": [0, 5, 6, 8, 9, 15, 18, 23, 28], "chenchong": [0, 8], "cheng": 0, "chenji": 0, "chenlin": [], "chenshuo": [0, 5, 8], "chenyang": [], "chet": 0, "cheung": 0, "chia": 0, "chiang": 0, "chieh": 0, "child": [0, 18], "chinchilla": 19, "ching": [], "chintala": [], "chitwan": [], "chiu": [], "chiyuan": [], "chl": [0, 18], "cho": [0, 9], "cho_unifying_2021": [], "choi": [0, 8, 9, 17, 18, 23, 25, 28], "choic": [4, 6, 8, 19, 24, 28], "chong": [0, 18], "chongxuan": [], "choos": 8, "choppi": 5, "choral": [0, 13], "chord": [2, 5], "choru": 2, "chosen": 9, "chou": [0, 17, 18], "chourdia": [], "chri": [0, 17, 18, 28], "christian": [], "christin": [0, 18], "christina": [], "christoph": 0, "chronolog": 11, "chu": [0, 6, 8], "chu_qwen": [], "chul": [], "chun": [], "chung": [0, 18], "cider": 6, "cinjon": [0, 17], "circul": 2, "circumv": 11, "cite": [], "citep": [], "cites": [], "citi": [0, 5, 8, 9], "cj": 0, "ck": [], "ckg": [0, 13, 14, 18], "ckm": [], "ckp": [], "clamp": [10, 24], "clap": [11, 12], "clariti": 11, "clark": [0, 28], "class": [2, 16, 21], "classic": [4, 5, 11, 16, 24, 27], "classif": [0, 3, 7, 12, 17, 18, 19, 27, 28], "classifi": [9, 10, 27], "claud": 19, "clean": [10, 11], "clean_fid": [], "clean_up_tokenization_spac": [4, 24], "cleaner": 11, "cleanli": 11, "clear": [6, 8, 9, 11], "clever": [11, 19], "clich\u00e9": 13, "click": [], "client": [], "clip": [4, 8, 9, 10, 15, 16, 19], "clip_anytorch": [], "clone": [4, 15], "close": [6, 8, 13, 19, 28], "closer": [5, 28], "closest": 17, "cloud": 8, "clpn19": [0, 18, 28], "cluster": [], "clz": [0, 18, 25], "cn24": [], "cnn": 8, "co": [0, 8, 10, 12, 15, 28], "coars": [2, 9], "coca": [0, 16], "code": [0, 3, 11, 15, 18, 19, 28], "codec": [11, 12, 18], "coeffici": 11, "cohen": [], "coher": 25, "col10": [], "col12": [], "colab": [4, 24], "colin": [0, 18], "collab": 10, "collabor": 28, "collect": [0, 7, 17, 19, 20, 27], "colleg": 15, "collin": [], "colloqui": 23, "color": 16, "colorcet": [], "com": [0, 4, 10, 15, 24], "combin": [3, 6, 8, 11, 12, 14, 15, 24, 27, 28], "come": [6, 8, 9, 11, 19, 20], "comm": [], "command": 10, "common": [4, 6, 7, 8, 19, 21, 28], "commonli": [6, 12], "commun": [2, 15, 16, 17, 28], "compani": 19, "companion": [], "compar": [6, 7, 12, 13, 16, 17, 18, 19, 21, 25, 26, 27, 28], "comparison": [6, 19, 26], "compat": [], "compil": [], "complement": 12, "complementari": 26, "complet": [3, 11, 15, 19, 23, 24, 26, 28], "complex": [3, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 27, 28], "compon": [3, 4, 6, 8, 18, 19, 21, 25, 28], "compos": [2, 8, 13, 25], "composit": [0, 13, 15], "comprehend": [0, 8, 15], "comprehens": [0, 3, 6, 12, 15, 26, 27, 28], "compress": [0, 11], "compris": [], "comput": [0, 5, 6, 8, 9, 11, 12, 13, 15, 17, 18, 19, 21, 26, 27, 28], "computation": [2, 21], "concaten": [2, 8, 11, 14], "concept": [4, 9, 11, 13, 16, 18, 19, 24, 26, 28], "conceptu": 11, "concern": [11, 19], "conclud": [3, 11], "conda": [10, 15], "condens": 11, "condit": [0, 2, 3, 4, 9, 10, 14, 17, 18, 19, 22], "conduct": [3, 23], "confer": [0, 5, 6, 8, 9, 15, 17, 18, 25, 28], "confid": 12, "config": [], "configpars": [], "configur": [], "cong": [], "congratul": [3, 24], "connect": [8, 17, 21, 24, 28], "connectionist": 0, "connelli": [], "consecut": 22, "consensu": 6, "consequ": [8, 28], "consid": [4, 6, 9, 12, 20, 22, 23, 26, 28], "consider": [2, 26], "consist": [0, 4, 5, 8, 13, 14, 15, 17, 28], "consistut": [], "consolid": 8, "constabl": [], "constant": [5, 19], "constantin": [0, 8], "constitu": 28, "constrain": [9, 28], "constraint": 13, "construct": [4, 28], "consum": 12, "consumpt": 20, "contain": [2, 5, 9, 15, 23, 25, 27], "contemporari": [4, 24], "content": [0, 2, 3, 4, 5, 7, 9, 15, 17, 23, 24, 27, 28], "context": [6, 8, 11, 20, 21, 23, 25, 28], "contextu": [6, 16, 19, 21, 22, 23, 25, 28], "contigu": 4, "continu": [3, 6, 12, 14, 19, 21, 22], "contourpi": [], "contrast": [0, 3, 12, 16, 18, 19, 24, 28], "contribut": [13, 15], "contributor": [], "control": [0, 5, 11, 13, 15, 17, 18, 19, 20, 22, 28], "controlnet": [0, 2, 18], "convei": [4, 9, 13], "conveni": 6, "convent": [15, 25], "converg": 0, "convers": [0, 3, 6, 8, 12, 15, 18, 22, 23], "convert": [10, 11, 14, 18, 28], "convert_tokens_to_id": 4, "convolut": [0, 9, 11, 14], "cooijman": [], "cook": [0, 9], "copet": [0, 18], "copi": 4, "copyright": 5, "core": [5, 8, 11, 28], "corner": 12, "corpora": [16, 28], "corpu": [0, 5, 21, 22, 23], "corpusid": 0, "corr": 0, "correct": [4, 11, 19], "correctli": [19, 26], "correl": [2, 6], "correspond": [9, 11, 12, 19, 21], "corrupt": [11, 28], "cosin": [6, 12, 26], "cosmo": 0, "cost": [11, 12, 19, 20, 27], "costli": [], "cot": 19, "could": [11, 17, 21, 23, 25, 27, 28], "couldn": [], "count": [8, 22, 26], "countri": 24, "coupl": 7, "cours": [8, 19], "courvil": [0, 17], "cover": [7, 9, 15, 16, 17, 21, 22, 23, 27], "coverag": [23, 27], "cp26": [], "cp27": [], "cp311": [], "cp32": [], "cp33": [], "cp34": [], "cp35": [], "cp36": [], "cp37": [], "cpcd": 25, "cpjku": [], "cpp": [], "cpu": [4, 10, 24], "cqt": [], "crawl": 5, "creat": [2, 3, 5, 6, 7, 12, 15, 17, 19, 23, 25, 26, 27, 28], "create_audio_html": 5, "creation": [0, 13, 15, 18], "creativ": [0, 2, 8, 13, 17], "cref": [], "criteria": [12, 23, 26, 27], "criterion": 4, "critic": [20, 26, 27, 28], "crop": 2, "cross": [2, 8, 11, 14, 19, 28], "cross_entropi": [4, 24], "crossentropyloss": 4, "crowdsourc": 5, "crucial": [9, 12, 26, 28], "csl": 13, "csrc": [], "cuda": [4, 10, 24], "cue": [13, 18], "cultur": [20, 23, 28], "cun": [], "curat": [0, 18, 25], "current": [2, 3, 6, 8, 9, 10, 13, 18, 21, 23, 25], "curti": 0, "curv": 11, "custom": [2, 5, 19, 24], "cut": 18, "cutoff": [20, 26], "cvf": [0, 5], "cvpr": [0, 5, 15], "cvpr52688": 0, "cvpr52729": [0, 5], "cvsf23": [], "cwbergkirkpatrickd20": [0, 13], "cwbkd20": [], "cwl": [0, 11, 13, 18], "cxh": [], "cxz": [0, 8, 28], "cxzg16": [], "cyclegan": [], "cycler": [], "cyran": 0, "cyril": [0, 17], "czj": [0, 13], "d": [0, 4, 10, 11, 15, 18, 19, 24], "d1": [], "d110f0a43beb365758a252203c43eaaad169fe7749da918869a8c991f726": [], "d1e337b9b4c8ea3aae5d399ace8c9cf4c2a7789cfe9d14766511fbc83c8b": [], "d2": [], "d23a97e0a2c690d40b165d1062e2c4ccc796be458a1ce59f6ba030434663": [], "d2805324fb746d8da86d3844bee4f55c0cfd6c136de61b713772d44c5bea": [], "d3": [], "d4": [], "d497a310bde3f01cb805196ac61b7ad6dc5dcf8dce66634dc34364b20b4f": [], "d5": [], "d78dc063216e62fc55f6b2eebb447f6a4b0a59f55c8406376f76bf959b08": [], "d8": [], "d9": [], "d_": 8, "d_c": 11, "d_h": 11, "d_k": [], "d_t": 11, "d_w": 11, "da": [], "dabeaf902892922777492e1d253bb7e1264cadce3cea932f7ff599e53fea": [], "dac": 11, "dacheng": [], "daeyong": [0, 23, 25], "dahl": [], "dai": [0, 5, 8, 9, 18, 28], "daiq": [], "dall": 19, "damien": [], "dan": [], "danc": [2, 5, 16], "danceabl": 5, "dang": [], "daniel": [0, 5, 8, 9, 18, 28], "danilo": 0, "dannenberg": [], "dao": [], "dao23": [], "dara": [], "dario": [0, 18], "dark": 16, "dasaem": [0, 28], "data": [0, 3, 5, 7, 8, 9, 11, 12, 13, 16, 17, 19, 20, 21, 23, 25, 27], "databas": [17, 19, 24, 26, 27], "datafram": 5, "dataload": [4, 24], "dataset": [0, 2, 6, 7, 8, 9, 15, 16, 17, 18, 19, 23, 25, 27, 28], "date": [13, 19, 23], "dateutil": [], "daunt": 20, "davi": [], "david": [0, 17, 18, 27], "dawen": [], "dazhong": [], "db": 24, "db99aa669eee301966bc6c997d60a0240f9cecae63f044b2e5a5310e4bf7": [], "dbvb17": [], "dc39062efec7515add304b98a54da2948709a808": [], "dcd": [0, 13], "dck": [0, 23, 25], "dcln23": [0, 8, 18], "dclt18": [0, 18, 28], "dcr": [0, 2], "dcsa22": [0, 11], "dctorch": [], "dd": [], "ddp09": [], "ddpm": [0, 2], "ddsp": [0, 13], "de": 11, "de3276d773ab6ce3ad676df5fab5aac19696b2956319d65d7dd88fb10f19": [], "deadlock": 10, "deaf": 7, "deal": [8, 9, 11, 21, 27], "decemb": [0, 8, 9], "decid": 19, "decis": 19, "decod": [3, 4, 5, 11, 14, 17, 18, 24], "decompos": 21, "deconvolut": 14, "decor": [], "decreas": 19, "dedic": [10, 11], "deep": [0, 4, 7, 8, 9, 12, 13, 15, 17, 18, 22, 24, 28], "deepak": [], "deepanwai": [0, 5], "deepbach": [0, 13], "deeper": [11, 15, 19, 25], "deepfak": 20, "deepli": 17, "deepmind": 15, "def": [4, 5, 24], "default": [4, 10, 24], "defferrard": [], "defin": [2, 4, 8, 10, 11, 18, 21, 22, 24, 26], "definit": [8, 18, 22], "defossezcsa23": [0, 14], "degara": [], "degre": [], "dehghani": [0, 18], "dekel": [], "delet": 10, "delic": 4, "delight": 3, "deliv": [4, 24], "delta": 28, "delv": [15, 18], "demo": [0, 6, 8, 10], "demonstr": [3, 14, 16, 19, 25, 28], "den": [0, 17, 28], "deng": [0, 5, 8, 9], "dengsheng": [0, 17], "denk": [0, 5, 18], "denois": 0, "denot": [8, 9, 11, 22, 28], "dens": [8, 28], "densiti": [2, 11], "denton": [], "depart": 15, "departur": 15, "depend": [6, 8, 9, 10, 11, 12, 15, 19, 21, 22, 28], "deploi": [], "depract": [4, 24], "depth": [3, 18], "deriv": [5, 11, 12, 15], "desc": [4, 24], "descent": 22, "describ": [0, 2, 4, 5, 7, 8, 9, 16, 23, 27], "descript": [0, 3, 6, 13, 15, 16, 19, 23, 24, 27, 28], "descript_audio_codec": [], "descript_audiotool": [], "description_evalu": [], "description_model": [], "description_models_t": [], "description_task": [], "descriptor": 9, "deserv": [], "deshmukh": [0, 8], "desideatum": 12, "design": [2, 4, 6, 7, 8, 9, 10, 11, 12, 23, 28], "desir": [17, 19], "desktop": 19, "desmaison": [], "despit": [17, 21], "dessert": 3, "desw23": [0, 8], "detach": [4, 24], "detail": [2, 4, 6, 8, 9, 12, 13, 15, 16, 18, 21, 22, 24, 28], "detect": [20, 27], "determin": [11, 19, 26], "develop": [0, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19, 22, 25, 27, 28], "devi": 0, "devic": [4, 10, 24], "device_typ": 10, "devin": [], "devis": 8, "devito": [], "devlin": [0, 18, 28], "df": 5, "df18d492a8f00d29a30db307904b9b296e37507034eedb523876f3a2e13": [], "df4b9b42f2be0b623cbd5e2140cafcaa2bef0759a00b7b70104dcfe2fb51": [], "df630c387a0a054815d60be6a97eb4e8f17385d5d6fe660e1c02750062b4": [], "dhabi": [0, 8, 9], "dhariw": [0, 18], "dhyy18": [0, 13], "di": 0, "dialog": 18, "dialogu": [0, 6, 7, 8, 9, 15, 23, 25], "dickstein": 0, "dict": [], "did": [8, 21], "diederik": 0, "diego": 15, "dieleman": [0, 11, 17], "diff": [0, 2], "differ": [4, 5, 6, 7, 8, 9, 11, 12, 14, 17, 23, 24, 26, 27, 28], "differenti": [0, 2, 8, 11], "differnt": [], "difficult": 19, "difficulti": [16, 19], "diffus": [0, 2, 3, 10, 13, 16, 18, 19], "diffwav": [], "dig": 22, "digit": 0, "dim": [4, 24], "dimens": [11, 21, 23], "dimension": [11, 28], "dimitra": [], "dinculescu": 0, "ding": [], "dinh": [], "diogo": [0, 18], "direct": [3, 7, 8, 11, 15, 19, 23, 28], "directli": [2, 5, 6, 8, 10, 11, 13, 19], "disabl": 10, "disadvantag": 3, "discard": 23, "discount": 6, "discov": [23, 25, 27], "discoveri": [0, 15, 23, 25], "discret": [0, 2, 3, 8, 11, 14, 18, 19, 21], "discrimin": [11, 14, 17, 19, 28], "discuss": [2, 3, 7, 8, 9, 11, 12, 15, 18, 19, 20, 25, 28], "dispatch": 19, "displai": [4, 5, 10, 24], "dispos": 8, "dissimilar": 4, "dist": [4, 24], "distanc": [0, 24, 26, 28], "distil": 0, "distinct": [12, 13, 27, 28], "distinguish": [0, 4, 7, 8, 17, 18, 28], "distribut": [2, 5, 8, 11, 12, 17, 19, 21, 22], "dit": 11, "ditto": [0, 2, 15, 18], "div": 10, "diverg": 12, "divers": [0, 15, 18, 23, 25], "dixon": 0, "djgd21": [], "djp": [0, 13, 18], "dkb14": [], "dl": 8, "dljn24": [0, 28], "dmitri": [0, 5, 6, 8, 23], "dmitrii": [], "dml": [0, 5, 8, 9], "dmp18": [], "dmp19": [0, 17], "dn21": [0, 11], "do": [0, 2, 6, 8, 11, 15, 19, 21, 22], "do_sampl": 4, "doc": 10, "docker": [], "docker_pycr": [], "dockhorn": [], "docnam": [], "docstr": [], "docstring_pars": [], "doctor": 15, "document": [0, 4, 9, 24], "doe": [6, 9, 10, 11, 12, 19, 21, 24], "doesn": [2, 11, 19, 23], "doh": [0, 5, 8, 15, 18, 23, 25, 28], "doi": [0, 5, 6, 8, 9], "domain": [0, 2, 3, 4, 6, 8, 11, 12, 13, 14, 15, 16, 17, 21, 28], "domin": 4, "dominik": 0, "don": [3, 8, 10, 19, 21, 23, 27], "donahu": [0, 17, 18], "donald": [], "done": [6, 10, 11], "dong": [0, 5, 8, 9], "dongchao": [], "dongdong": [], "dongjun": 0, "dongt": [0, 8], "dorien": [0, 5], "doshi": [], "dot": [9, 12, 28], "dougla": [0, 9, 17, 18, 27], "down": [11, 28], "downbeat": [], "download": [10, 24], "downsampl": 11, "downstream": [15, 16], "dpm": [], "dpmpp": 10, "dpo": 19, "dramat": [25, 28], "draw": 12, "drawback": 9, "dreambooth": [], "dreamfus": [], "drift": 11, "drive": 13, "driven": [0, 8, 17, 18], "drop": [0, 28], "drop_last": [4, 24], "dropout": 28, "drum": 2, "dsdb16": [], "dtype": [4, 24], "du": [0, 18], "duan": 0, "dubei": [], "dubnov": [0, 8, 9, 18, 28], "duc": 0, "due": [13, 20, 21, 25], "duet": 24, "duh": [0, 5, 8, 9], "dumoulin": [], "dung": [], "durand": [0, 8, 9, 18], "durat": [10, 13], "dure": [12, 15, 18, 19, 27, 28], "dvdos18": [], "dwcn23": [0, 16, 18, 28], "dylan": [], "dynabert": [], "dynam": [0, 13, 19, 24, 28], "d\u00e9fossez": 0, "e": [0, 2, 4, 5, 8, 9, 11, 16, 19, 21, 22, 25, 26, 27, 28], "e0": [], "e07ce413d16ef64e885bea37551eac4c5ca3ddd440933f9c94594273d0d9": [], "e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4": [], "e1127810de8b60a58bfa682f858fd7ba36667d29c0b9ad3b6ff10d6cb944": [], "e1956f7ca582a22dd1f17b9e26fcb8229051b0ce6d33b47227824772feec": [], "e2": [], "e3": [], "e4": [], "e5": [], "e7": [], "e8": [], "e8c04e80e82391a6e51f218ca49720f64236bc824e92152a2633b74cf7ab": [], "e9": [], "e9fcff7623954d86bdc17782036cbf715ecab1bec4847c008557affe1ca8": [], "e_": 8, "ea": [], "each": [6, 8, 9, 11, 12, 17, 19, 21, 23, 25, 26, 27, 28], "ead346e904390a53e71b5da2df7e7839abb16e967ba07fa15addf1f9f37c": [], "earli": [8, 13, 19, 28], "earlier": [8, 19, 26], "earliest": 8, "easi": 8, "easier": [7, 17, 19], "easili": 17, "easy_gener": 10, "eb": [], "ebnj33fcrl": 0, "ec": [], "ecal": [5, 28], "eck": [0, 17], "econom": 20, "economi": 20, "ect": [0, 11, 13], "ed": [], "edg": 18, "edgar": [], "edict": [], "ediff": [], "edit": [0, 2, 5], "editor": [0, 5, 6, 8, 9], "edmsound": 0, "educ": [7, 15], "edward": [], "edwin": [], "ee": [], "ee39c6e92acc742c052f137b47c210cd0a1b72dcd3f98495528bb4d27761": [], "eerili": 11, "eess": [0, 6, 8], "effect": [0, 5, 12, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "effici": [0, 8, 9, 11, 13, 15, 28], "effort": [20, 25], "efro": [], "egregi": 20, "ehgr20": [0, 13], "ehohc": [], "ehsan": [], "eikan": [], "einop": 10, "einops_ext": [], "einsum": 24, "either": [6, 8, 9, 10, 11, 19, 28], "elabor": 12, "elbmg07": [0, 17], "electr": 24, "electrifi": [4, 24], "electron": [0, 5, 16, 28], "element": [9, 10, 24, 28], "elena": [0, 8, 9], "eleph": 2, "eleventh": 0, "eli": [], "elia": [], "elio": [0, 6, 8, 9, 15, 18, 28], "elizald": [0, 8], "ell": 11, "elli": [0, 18, 28], "ellison": [], "eloi": [], "els": [4, 10, 16, 24], "elsen": [], "elucid": [], "ema": [], "ema_pytorch": [], "emanuel": 0, "emb": [4, 11], "embed": [0, 2, 3, 4, 6, 8, 11, 12, 14, 16, 18, 19, 21, 23, 24, 25, 26], "embedding_cat": 4, "embedding_prefix": 4, "embedding_text": 4, "embeddings_2d": 16, "emed": 28, "emerg": [6, 8, 9, 13, 15, 19, 25], "emili": [], "emilian": 0, "emir": [0, 8, 9], "emmanouil": [0, 5, 6, 8, 9, 15, 18, 23, 28], "emmanouilid": [], "emnlp": [0, 8, 9], "emot": [0, 4, 9, 13, 20, 24], "emphas": [18, 25, 26], "emphasi": [15, 17], "empir": [0, 8, 9, 19], "emploi": [8, 14, 20, 21], "emr": [], "emu": [], "en": 10, "enabl": [4, 5, 7, 8, 13, 15, 16, 18, 19, 23, 24, 25, 27, 28], "enchant": 4, "encod": [3, 4, 11, 13, 14, 17, 18, 24, 28], "encodec": [11, 14, 19], "encompass": [9, 23], "encount": 28, "encourag": [11, 15, 19, 28], "end": [0, 5, 6, 11, 17, 21, 24], "endeavour": 6, "energet": [24, 28], "energi": 20, "enforc": 21, "engag": 25, "engel": [0, 5, 17, 18], "engin": [15, 17, 20], "english": [5, 19, 20, 21], "enhanc": [0, 3, 5, 8, 9, 13, 18, 19, 28], "enjoi": 9, "enorm": [], "enough": [4, 19], "ensembl": [], "ensur": [4, 12, 26, 28], "enter": 17, "entir": [8, 9, 11, 12, 19, 23, 28], "entiti": 28, "entropi": [12, 14, 28], "enumer": 16, "env": 10, "envinro": 10, "environ": [10, 15], "eos_token_id": 4, "eot": 21, "ep": [], "epc": [0, 2, 11], "epoch": [4, 24], "epoch_loss": [4, 24], "epstein": [], "epur": [0, 8, 9], "equal": [4, 6], "equat": [0, 11], "equilibrium": 0, "equit": 20, "equival": 19, "er": 20, "era": [9, 15], "eri75": [], "eric": [], "erich": [], "erickson": [], "erik": [0, 9], "ermon": 0, "err": [0, 13, 17], "error": 19, "escap": 5, "escriv": [], "esl": 0, "especi": [9, 11, 19, 20, 21], "essenti": [11, 12, 18, 19], "esser": [], "establish": [8, 9, 15, 26], "estim": [0, 27], "et": [4, 8, 9, 24, 25], "eta": [], "etc": [11, 16, 21, 27], "ethan": 0, "euclidean": 26, "eugen": [], "eunggu": [], "evad": [4, 24], "eval": [4, 24], "evalu": [0, 3, 5, 7, 8, 9, 13, 15, 18, 19, 23, 24, 28], "evan": 0, "even": [2, 4, 6, 9, 12, 19, 28], "event": [9, 23], "ever": 24, "everi": [8, 19, 21], "evgeni": [], "evolut": [3, 7, 9, 18, 23, 25], "evolv": [13, 17, 23, 28], "exact": [6, 8, 11], "exactli": [21, 24], "exam": 21, "examin": [3, 18, 28], "exampl": [2, 3, 4, 5, 6, 7, 8, 9, 13, 14, 15, 16, 19, 21, 23, 25, 27, 28], "excel": [12, 16], "except": 21, "excit": [2, 3, 16, 24], "exclus": 23, "execut": 19, "exemplifi": [13, 20], "exercis": [2, 18], "exhibit": 20, "exist": [2, 5, 7, 8, 11, 16, 17, 19, 28], "exit": [], "exp": [8, 24, 28], "expand": [4, 5, 13, 27, 28], "expect": 9, "experi": [0, 4, 9, 23, 24, 25, 28], "experiment": [0, 28], "expert": [], "expertis": 15, "explain": [7, 15, 27], "explicit": 23, "explicitli": 10, "exploit": [], "explor": [0, 3, 9, 13, 15, 18, 23, 24, 25, 28], "export": 10, "expos": 28, "express": [0, 2, 9, 13, 16, 23, 25, 28], "expressivenss": 13, "ext": [], "extend": [0, 8, 24], "extens": [19, 21], "extern": [0, 19], "extra": 11, "extract": [2, 4, 11, 24, 28], "extractor": 8, "extrem": 21, "f": [0, 4, 5, 6, 9, 10, 11, 24, 28], "f0": [], "f0b9ad6c0a9017e62d4735daaeb11ba3b6c009d69a26141b258cd37b5588": [], "f185bfd0ca1d213beb4293bed51d92254df23d8ceaf6c0e17146d508a776": [], "f2": [], "f2b75d2fc6f1a260f340f0e7c6a060f4dd2961cc16884ed851b0d18da06a": [], "f4": [], "f5": [], "f6": [], "f6bd1eee09314e7e6dee49cbe2c5e22314ccdb38db16c9fc72d2fa80d054": [], "f7": [], "f7e21b113dd48a9c97d364e0915b3988c6a0b6207652f5a92372871b7aa4": [], "f9": [], "f9d7fe80a8fcce9bb128d1381c6fe41a8d286d7e18395e273002e8e0fa34": [], "f_": [8, 11], "fa": [], "fabien": [0, 28], "face": [7, 13, 24, 25, 27], "facilit": 15, "fact": [8, 11], "factor": [0, 26], "fadernet": [], "fail": [2, 6, 19, 23], "failur": 2, "fair": [20, 26], "fall": [8, 27], "fals": [4, 5, 10, 24, 26], "familiar": [4, 24], "fan": 4, "fandong": [], "fang": [], "fantast": 11, "far": [2, 4], "farid": [], "fashion": [9, 23], "fast": [0, 5], "fastapi": [], "fastcor": [], "faster": 19, "fatigu": 12, "favor": 2, "favour": 8, "fazeka": [0, 6, 8, 9, 15, 18, 28], "fb": [], "fc": [], "fd": [], "feasibl": 4, "featur": [0, 2, 4, 5, 7, 8, 9, 10, 11, 14, 17, 19, 21, 23, 24, 27, 28], "fed": 21, "federico": [], "fedu": [0, 18], "feed": 19, "feedback": [0, 3, 12, 18, 23, 25], "feel": 5, "felix": [0, 18], "femal": [4, 24, 28], "feng": [], "ferjad": [], "fernando": [0, 25], "few": [0, 2, 4, 6, 8, 10, 11, 19, 26], "fewer": 26, "ff": [], "ff642e65ad6b90db43e668d70ffb6736436c7ce41fcc549f4e9472234127": [], "ffbf7a134b9ab11a67b0cf0726453cedd9c5043a4fe7a35d1cefa9a1bcfb": [], "ffmpy": [], "fid": [], "fidel": [0, 20], "fidler": [], "field": [3, 13, 15, 19, 20, 22, 28], "figsiz": 16, "figur": [13, 14, 16, 21], "file": [10, 24], "filelock": [], "filip": [0, 18, 25], "filippo": [], "fill": [19, 21, 22], "film": [7, 11, 21], "filter": [17, 27, 28], "filterwarn": [10, 16], "final": [4, 6, 7, 9, 11, 12, 13, 15, 19, 27], "find": [0, 4, 5, 7, 8, 9, 17, 19, 20, 23, 24, 26, 27], "fine": [0, 2, 5, 8, 9, 11, 19, 28], "finetun": [0, 8, 11, 18], "finit": 21, "finnicki": 10, "fire": [], "firmli": 15, "first": [4, 6, 8, 9, 10, 11, 12, 13, 15, 17, 19, 21, 23, 24, 26, 27, 28], "fisch": [], "fischer": [], "fit": [2, 8, 11], "fit_transform": 16, "fix": [8, 9, 11, 17, 19, 21, 28], "fjeld": [], "flamingo": [0, 8, 19], "flash": [], "flashattent": [], "flat": 12, "flatten": 24, "flatten_dict": [], "flavio": [], "fleet": [], "flexibl": [3, 9, 15, 16, 19, 21, 22, 27, 28], "flexibli": 19, "float": 11, "float32": 10, "flore": 0, "florencia": [0, 18], "flori": 0, "florian": [], "flow": 0, "fltz10": [0, 17], "fm": 24, "fm22": [0, 11], "fma": 8, "fn": 26, "focu": [2, 6, 9, 11, 13, 15, 17, 23, 25, 28], "focus": [3, 6, 9, 13, 15, 16, 17, 18, 21, 26, 28], "folk": 24, "follow": [0, 2, 3, 4, 7, 8, 10, 11, 12, 13, 14, 17, 18, 19, 21, 26, 28], "fontsiz": 16, "fonttool": [], "foot": 22, "forc": 23, "foreign": [4, 24], "forget": 21, "forgo": 8, "fork": 10, "form": [2, 5, 6, 7, 8, 9, 11, 16, 19, 27], "formal": [2, 11, 23], "format": [4, 6, 19, 22, 28], "former": 8, "formul": [7, 18, 23, 28], "forsgren": 0, "forth": [23, 24], "forum": [0, 8], "forward": [4, 11, 13, 14, 24], "fossez": [0, 18], "foster": 15, "found": [8, 19], "foundat": [0, 4, 8, 9, 15, 16, 18, 24, 28], "four": 26, "fourier": 0, "fp": 26, "fr": 0, "frac": [8, 26, 28], "fragkiadaki": [], "frame": 4, "framework": [3, 4, 8, 15, 18, 19, 22, 27, 28], "fran": 0, "franci": [], "francisco": 15, "francoi": 0, "frank": 0, "frechet": [], "freder": [], "fredo": [], "free": [0, 2, 4, 10, 24], "freedman": [], "freedom": [], "freeman": 0, "freeu": [], "freez": [4, 24], "freeze_backbone_model": 4, "freeze_parma": [4, 24], "french": 19, "fresh": 23, "fri": [], "from": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 23, 24, 25, 26, 27, 28], "from_pretrain": [4, 24], "frontier": 19, "frozen": [8, 19], "frozenlist": [], "fsspec": [], "ftfy": [], "fu": [0, 8, 17], "full": [2, 9, 13, 19, 23, 24], "fulli": [2, 5, 8, 10, 11, 20, 23], "function": [2, 4, 5, 8, 11, 13, 14, 24, 26], "functool": 10, "fundament": [17, 25, 26, 28], "furkan": 0, "further": [8, 9, 11, 13, 15, 18, 19, 22, 28], "furthermor": [16, 18, 28], "fuse": [], "fusion": [0, 28], "futga": [0, 5, 8, 9], "futur": [3, 15, 17, 20, 21, 23, 28], "futurewarn": [4, 10, 24], "g": [0, 2, 4, 5, 8, 9, 10, 11, 19, 22, 25, 26, 27, 28], "g_": 11, "ga": 0, "gaa": [], "gabbolini": [0, 8, 9], "gabriel": [0, 18, 28], "gain": 28, "gal": [], "game": [], "gamma_": 11, "gamper": [], "gan": [0, 11, 19, 21], "gang": [0, 8], "ganguli": [], "ganti": [0, 18, 25, 28], "gao": [0, 9], "gap": [3, 23, 27, 28], "garcia": 0, "gardner": [0, 8, 9, 18], "gareth": 0, "gat": [0, 18], "gate": [8, 11], "gaussian": 11, "gayoung": [], "gdsb23": [0, 8, 9, 16, 18], "ge": [0, 5, 8, 9], "geeta": [], "gef": [], "gemmek": [], "gen": [], "gener": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 20, 21, 22, 23, 25, 28], "generalis": 9, "generate_diffusion_cond": 10, "generated_text": 4, "genr": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "geoffroi": [0, 9], "geometr": [6, 28], "geon": [], "georg": [0, 6, 8, 15], "geq": [], "gerhard": [], "germain": 0, "german": 19, "gert": [0, 17, 18, 27], "gestin": [], "get": [6, 10, 11, 19], "get_device_nam": [4, 24], "get_ipython": 4, "get_item_vector_db": 24, "get_pretrained_model": 10, "get_query_embed": 24, "geyu": [0, 6], "ggbe24": [], "ggdre22": [0, 13], "ggre21": [], "ghanem": [], "gharbi": [], "ghasemipour": [], "ghe22": [0, 8, 9], "gherardi": [], "ghosal": [0, 5], "gil": [], "gimelshein": [], "gin": [], "gin_config": [], "ginneken": [], "ginsburg": [], "giorgi": 0, "giorgio": 0, "giovanni": [0, 8, 9], "girish": [0, 28], "girl": 24, "git": 15, "git18": [], "gitdb": [], "github": [4, 10, 15, 24], "gitpython": [], "give": [5, 8, 9, 11, 19], "given": [2, 6, 8, 9, 11, 12, 13, 17, 19, 21, 22, 26, 27], "gl83": [0, 12], "glasner": [], "glass": [0, 8], "gll": [0, 8], "global": [9, 11, 28], "glove": 28, "gltq23": [0, 9], "gmmp23": [], "go": [2, 9, 11, 13, 18, 19, 22, 24], "goal": [3, 6, 7, 9, 11, 19, 28], "goe": 19, "goel": [0, 8], "goh": [0, 28], "gokul": [], "golai": [], "gold": 6, "goldberg": [0, 8, 9], "golden": [], "gome": [], "gomez": [0, 5, 8, 9], "gone": 7, "gong": [0, 8], "gongfan": [], "gontijo": [], "good": [2, 11, 12, 19], "goodfellow": 0, "googl": [4, 13, 15, 24], "gordon": 0, "got": [4, 10, 24], "gouyon": [0, 28], "goyal": [], "gpt": [0, 4, 5, 8, 15, 18, 19, 22, 28], "gpt2": 4, "gpt2lmheadmodel": 4, "gpt2token": 4, "gpu": [4, 24], "grachten": 0, "gradient": [0, 2, 11, 22], "gradio": [], "gradio_cli": [], "gradual": 11, "grai": 21, "grain": [0, 2, 5, 8, 9, 28], "gram": [6, 22], "grandios": 4, "granular": 27, "graph": [6, 19, 28], "grave": [0, 17], "great": [2, 4, 11, 20], "greater": [13, 28], "greatest": 16, "green": [0, 13, 16, 17, 21], "greenwood": 0, "greg": [], "gregori": [], "grew": 17, "griffin": 0, "gritsenko": [], "groh": [], "grosch": [0, 28], "gross": [], "ground": [4, 6, 24, 26], "groundwork": 17, "grounth": 6, "group": 0, "grow": [2, 25, 27], "grown": 13, "grpcio": [], "grug17": [], "gschwind": [], "gskp23": [0, 2, 13], "gt": 4, "gu": 0, "guanglu": [], "guangzhi": [0, 8], "guestrin": [], "gui": [0, 8], "guid": [0, 2, 5, 8, 13, 23, 25], "guidanc": [0, 2, 10, 15], "guitar": [2, 5, 10, 16, 24], "gulrajani": [0, 17], "gunjan": [], "guo": [0, 5, 8, 9], "guojun": [0, 17], "gupta": [], "gupta2023photorealisticvg": [], "guu": [0, 18], "gy": [0, 8, 9, 18, 28], "gy\u00f6rgi": [0, 9], "gz": [], "h": [0, 8, 11], "h11": [], "h5py": [], "h_audio": 24, "h_text": 24, "ha": [0, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 23, 27, 28], "had": [11, 28], "hadjer": 0, "hai": [], "hall": 0, "hallaci": [0, 28], "hallucin": [19, 20], "han": [0, 8], "hand": [7, 9, 19], "handl": [11, 16, 17, 19, 23, 25], "hang": [0, 8], "hani": [], "hann": [], "hantrakul": 0, "hao": [0, 5, 8, 9], "haoh": [0, 18], "haoran": [], "haoxin": [], "haoyi": [0, 28], "happen": [4, 19], "happi": [5, 16, 27, 28], "hard": [2, 4, 7, 11, 19, 28], "harder": 9, "harm": [], "harmon": [2, 6], "hat": [8, 11], "have": [2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 22, 23, 24, 27, 28], "hawlei": 0, "hawthorn": 0, "hayk": 0, "hce": [], "he": [0, 8, 9, 15], "head": [19, 28], "hear": [0, 7, 8], "heard": 5, "heart": 24, "heavi": [5, 28], "heewoo": [0, 18], "heiga": [0, 17], "height": 11, "helen": [], "helena": [0, 5, 8, 9], "hellsten": [], "help": [0, 3, 6, 12, 17, 19, 26, 28], "hennequin": [0, 8, 9], "her": [4, 15, 24], "herald": 15, "herbert": [], "herd": [], "here": [2, 7, 10, 11, 21, 24], "herman": [], "herreman": [0, 5], "herrera": [0, 9], "hershei": [], "hertz": [], "hertzmann": [], "hesit": 3, "heusel": 0, "hf_token": 10, "hh": [], "hhl": [0, 8, 9], "hhy": [], "hi": 15, "hi79": [0, 13], "hidden": [8, 11, 21], "hierarch": 0, "high": [0, 2, 9, 11, 12, 16, 17, 19, 27, 28], "higher": [6, 7, 11, 13, 26], "highest": 26, "highli": 12, "highlight": [2, 4, 6, 7, 10, 15, 23, 25], "hila": 0, "hiller": 0, "hilton": [], "hing": 28, "hint": 19, "hiromi": [0, 6, 8], "hirsh": [], "histor": [21, 23], "histori": [9, 22, 23, 25], "hit": 11, "hja20": [0, 13], "hjc": [], "hjl": [0, 16, 18, 28], "hla": [], "hlss23": [0, 5, 8], "hmt": [], "ho": 0, "hochreit": 0, "hoffman": 0, "holger": [0, 28], "holist": [0, 19], "holoview": [], "holynski": [], "hongsheng": [], "hongyin": [0, 8], "hook": 24, "hope": 3, "horac": [], "hot": [16, 17, 18, 19], "hotel": 28, "hou": [0, 18], "how": [0, 2, 3, 4, 6, 7, 8, 9, 11, 15, 17, 18, 19, 21, 22, 23, 25, 26, 28], "howev": [8, 9, 11, 12, 15, 17, 19, 21, 23, 25, 27, 28], "hpn17": [0, 13], "hpw": [], "hru": [0, 12], "hs21": [], "hsg": [], "hsiang": [0, 6, 8], "hsiao": 0, "hsin": 0, "hsr": [], "hsuan": [0, 17, 18], "ht": 28, "html": 5, "http": [0, 4, 5, 6, 8, 9, 10, 15, 24], "httpcore": [], "httpx": [], "hu": [], "huam": [0, 8], "huang": [0, 5, 8, 9, 18, 28], "hub": [], "hubert": 0, "hug": 24, "huge": [16, 19, 20, 21, 22], "huggingfac": [4, 10, 24], "huggingface_hub": 10, "hugo": 0, "hui": [0, 28], "huiwen": 0, "hum": 5, "human": [0, 6, 7, 9, 13, 15, 17, 18, 20, 23, 25, 27, 28], "humphrei": [], "hundr": 19, "hussain": [0, 5, 8], "hvu": [0, 13], "hy20": [0, 13], "hybrid": 8, "hyelin": [], "hyper": 2, "hyperparamet": 28, "hyung": [0, 18], "hyungjin": [], "hzrs16": [], "i": [0, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], "ian": 0, "icassp": [0, 5, 8, 9, 15, 18, 28], "icassp48485": [0, 5, 8], "iccv": 0, "iclr": [0, 17], "icml": [0, 8, 15, 18], "id": [0, 5, 8], "idea": [11, 16, 19, 28], "ideal": 19, "ident": 11, "identifi": [12, 19, 26, 28], "idf": 6, "idna": [], "ieee": [0, 5, 8, 9, 15, 17, 18, 27, 28], "ieeexplor": [0, 9], "iffus": [], "ignor": [10, 16, 23], "ignore_index": 4, "ijcai": [0, 8], "ijcnn": [0, 8, 9, 18], "ijcnn54540": [0, 9], "ikemiya": 0, "il": [], "ilaria": [0, 3, 5, 6, 8, 9, 15, 18, 23, 28], "ilg": [0, 18], "illia": 0, "illustr": [13, 14, 18, 19], "ilya": [0, 18], "imag": [0, 6, 8, 11, 12, 16, 17, 19, 20], "imagegpt": 19, "imageio": [], "imagin": [2, 11, 24], "imbu": 11, "immers": [], "impact": [8, 16, 20], "imperi": 15, "implement": [8, 20, 22, 24, 28], "implicit": 23, "implicitli": 11, "import": [3, 4, 5, 6, 10, 11, 16, 19, 23, 24, 25, 26, 28], "importlib": [], "importlib_resourc": [], "impract": 27, "impress": [13, 24], "improv": [0, 4, 13, 19, 23, 24, 25, 26, 28], "inabl": [2, 23, 25], "inaccur": 19, "inbar": [], "inc": 0, "includ": [4, 6, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 22, 26], "inclus": [12, 15], "incorpor": [4, 9, 13, 14, 18, 19, 23, 25, 28], "incorrect": 19, "incorrectli": 26, "increas": [19, 20, 27, 28], "increasingli": [3, 7], "incredibli": 24, "independ": [23, 25], "index": [4, 11, 24], "indi": 5, "indian": [4, 24], "indic": [12, 19, 21, 24, 26, 27], "individu": 28, "indulg": 15, "infer": [0, 2, 10, 16, 18, 19, 23, 27], "infinit": [], "influenc": [2, 8, 18, 24, 28], "influenti": [21, 22], "info": [], "infonc": 28, "inform": [0, 2, 4, 6, 8, 9, 13, 15, 16, 17, 18, 19, 21, 22, 23, 25, 26, 27], "informat": [4, 24], "infus": [4, 24], "inher": 23, "init": 2, "init_temperatur": 24, "initi": [4, 8, 11, 13, 17, 23, 27], "initialis": 8, "inlin": [], "innov": [13, 15, 20, 28], "inpaint": 2, "input": [2, 4, 8, 9, 11, 12, 13, 14, 17, 21, 22, 28], "input_id": [4, 24], "inputs_emb": 4, "insid": [11, 21], "insight": 12, "inspir": [4, 13], "instal": [4, 10, 15], "instanc": [21, 28], "instead": [5, 6, 7, 8, 9, 11, 19, 21, 23], "institut": [0, 17, 20, 27], "instruct": [0, 5, 6, 8, 9, 18], "instruct_2023": [], "instructgpt": 19, "instructpix2pix": [], "instrument": [0, 2, 5, 9, 13, 16, 17, 23, 27, 28], "int": [4, 24], "int16": 10, "integ": 21, "integr": [11, 12, 13, 18, 19, 25], "intellig": [0, 8, 15, 19, 28], "intend": [13, 22], "intens": 2, "intent": [0, 16, 19, 23, 25], "interact": [9, 13, 15, 19, 20, 23, 25, 28], "interest": [2, 7, 15, 17, 22, 23, 24, 25], "interestingli": 19, "interfac": [8, 15, 25], "interleav": [11, 19], "intermedi": [0, 8], "intern": [0, 5, 6, 8, 9, 11, 15, 17, 18, 19, 20, 25, 28], "internationaltunion01": [], "internet": [20, 28], "interpret": [8, 19], "interspeech": 0, "interv": [], "intervent": 0, "intric": 4, "intro": 11, "introduc": [3, 11, 13, 14, 18, 19, 21, 25, 28], "introduct": 18, "intuit": [2, 16, 19, 25, 27], "invalu": 12, "invers": [0, 2], "invert": [], "invertib": 2, "investig": [3, 18], "invok": 19, "involv": [10, 11, 17, 19, 26], "io": [], "ipd": 10, "ipython": [4, 5, 10, 24], "iqbal": 0, "irani": [], "iren": 0, "irrelev": [26, 28], "is_avail": [4, 10, 24], "isaacson": 0, "isca": 0, "ish": 4, "ishaan": [0, 17], "ismir": [0, 6, 8, 9, 11, 15, 17, 18, 28], "ismir2008": 17, "ismir2019": 17, "ismir2021": 17, "isn": [5, 6, 19, 23], "isnn": [0, 8], "isola": [], "isotrop": 11, "issn": [0, 9], "issu": [4, 6, 10, 23, 24], "itai": [0, 18], "item": [0, 4, 5, 8, 9, 16, 18, 24, 25, 26], "item_joint_embed": 24, "item_vector_db": 24, "iter": [2, 8, 23], "its": [2, 4, 7, 8, 9, 12, 14, 15, 16, 17, 19, 20, 22, 23, 27, 28], "itself": [2, 5, 9, 11, 19], "itu": 0, "izze17": [], "j": [0, 18, 28], "jaakko": [], "jaakkola": [], "jack": [0, 28], "jacob": [0, 18, 28], "jacquelin": [0, 9], "jade": [0, 18], "jae": 0, "jaegl": [], "jaejun": [], "jaesik": [], "jaewoong": [], "jai": [], "jain": [0, 17], "jakob": 0, "jame": [0, 8], "jamendo": [5, 8], "jampani": [], "jan": [0, 5], "janko": [0, 18], "jann": 0, "janner": [], "jansen": [0, 5, 18, 28], "jargon": 19, "jasa": [], "jascha": 0, "jasco": 2, "jason": [0, 18], "jauhri": [], "javier": 0, "jayasumana": [], "jazz": [27, 28], "je": [], "jedi": [], "jeffrei": [0, 9, 18], "jen": [], "jeong": [0, 18, 28], "jeongsol": [], "jess": [0, 5, 17, 18], "jessica": [], "ji": [], "jiacheng": [], "jiaheng": [0, 5, 8, 9], "jiahui": 0, "jiaji": [], "jiajia": [], "jiam": [], "jian": [], "jianbin": [], "jianfei": [], "jiang": [0, 6, 8, 18], "jianglong": [], "jianlin": [], "jianmin": [], "jianxin": [0, 28], "jiasheng": [0, 8], "jiawei": [], "jiayi": [], "jie": [0, 8], "jimmi": [], "jin": [0, 6, 8, 23], "jinan": [], "jinbo": [], "jing": [], "jinglin": [], "jingren": [0, 6, 8], "jingwei": 0, "jinja2": [], "jinwoo": [0, 6], "jiong": [], "jitong": 0, "jiwen": [], "jiyoung": [0, 18, 28], "jmespath": [], "jnmr": [0, 9], "joao": [], "joar": [], "job": 20, "joblib": [], "john": 0, "join": [4, 8, 24], "joint": [0, 2, 3, 8, 9, 18, 23, 25], "joint_dim": 24, "jointembeddingmodel": 24, "jointli": 28, "jona": [], "jonah": 0, "jonathan": 0, "jone": 0, "jong": [0, 15, 18, 28], "jongmin": [], "jongpil": [0, 8, 9, 17, 18, 28], "jongwook": 3, "joon": [], "joonseok": [0, 18, 28], "jooyoung": [], "jordi": 0, "jort": [], "jose": [0, 17], "josef": [0, 5], "joseph": [], "josh": [0, 8, 9, 18], "joshua": [], "josiah": 0, "journal": [0, 9, 17, 18, 23], "journei": 17, "joy": 5, "jrv": [], "jsonmerg": [], "jsonschema": [], "ju": 0, "juan": 0, "judg": [], "judgement": 6, "judith": [0, 18, 28], "juhan": [0, 8, 9, 15, 17, 18, 23, 25, 28], "juho": [], "jukebox": [0, 13, 15, 18], "jukedrumm": [], "julian": [0, 5, 8, 9, 15, 17, 18], "julio": [], "juliu": [], "jump": 11, "jun": [0, 8, 18], "junbo": 0, "junda": [0, 5, 8, 9], "june": [0, 5, 8, 9], "junghyun": 0, "junho": [], "junqi": [0, 8], "junyan": 0, "jupyt": 15, "just": [10, 11, 12, 19, 20, 21, 23, 24, 25, 27], "justin": [0, 5, 28], "k": [8, 9, 14, 24, 26], "k_diffus": [], "kaal22": [], "kadian": [], "kai": [0, 17], "kaim": [], "kaiser": 0, "kaist": 15, "kakao": 15, "kal": [], "kalambarkar": [], "kalchbrenn": [0, 17], "kamko": [], "kamyar": [], "kang": [], "kant": [0, 18], "kao": [], "karagol": [], "karan": 0, "karen": [0, 17], "karlinski": [0, 8], "karra": [], "karsten": [], "karunratanakul": [], "kastner": [], "katarina": [0, 18], "kate": [0, 8], "katerina": 0, "katharopoulo": 0, "katherin": [0, 18], "kavukcuoglu": [0, 17], "kawar": [], "kazuhito": [], "kb": [], "kb14": [], "kbockw15": [], "kci": [0, 12], "ke": [0, 3, 5, 8, 15, 18, 19, 23, 28], "keep": 24, "keepdim": 16, "kei": [3, 7, 8, 9, 10, 12, 13, 14, 17, 18, 23, 26, 27, 28], "keji": [], "kelvin": [0, 18], "kenton": [0, 18, 28], "kept": 8, "keqiang": [], "keren": [], "keunwoo": [0, 8, 9, 17, 18, 23, 25, 28], "kevin": [0, 5, 8, 9], "kexin": [], "keyword": [0, 13, 28], "kfir": [], "kgb": [0, 8], "kharitonov": [], "khurana": 0, "khz": 11, "ki": [], "kilgour": 0, "kilian": [], "kim": [0, 9, 15, 18, 23, 25, 28], "kind": [5, 6, 9, 19, 21], "kingma": 0, "kirchhoff": [0, 28], "kirel": [], "kirkpatrick": [0, 8, 15, 18, 28], "kirsch": [], "kiwisolv": [], "kjz24": [], "kkdb": [], "kkkm23": [], "kl": [11, 12], "klaski": [], "kll": [0, 11], "knife": [0, 6, 8], "knob": 10, "knolwedg": 4, "know": [9, 21], "knowledg": [0, 4, 8, 9, 16, 19, 20, 21, 24, 28], "known": [10, 27, 28], "koh": [], "kohler": [], "koichi": [], "koishida": [], "kong": [0, 8], "konpat": [], "koo": 0, "korai": [0, 17], "kornia": [], "kornia_r": [], "korraw": [], "korzeniowski": [], "kosta": 0, "kostrikov": [], "kozareva": [0, 8, 9], "kpa": [], "kph": [], "kpschonfeld": [], "krasheninnikov": [], "kreb": [], "krei": [], "kreuk": [0, 18], "kristina": [0, 8, 9, 18, 28], "krisztian": [0, 18, 25], "krueger": [], "kshiteej": [], "ksl": [0, 11], "ksm": [0, 9], "ksp": [0, 13], "ku": [0, 6, 8], "kullback": 12, "kumar": [0, 17], "kundan": [0, 17], "kuznetsov": 0, "kw13": [], "kwak": [], "kwan": [], "kwg": [0, 2], "kwon": [0, 23, 25], "kyle": [], "kynk": [], "kynkaanniemiak": [], "kyogu": [0, 6], "kyunghyun": [0, 9], "kzb": [], "kzl": [], "kzrs18": [], "kzrs19": [0, 12], "kzz": [], "l": [6, 8, 9, 28], "l1": 14, "l177": 10, "l2": 14, "lab": 15, "label": [0, 4, 7, 9, 12, 17, 19, 21, 24, 26, 27], "lack": [2, 11, 19, 23, 28], "lai": 0, "laid": 17, "lain": [], "laion": [], "laion_clap": [], "lala": [], "lam": [], "lam08": [0, 17], "lama": [0, 18], "lamer": [0, 17], "lamtharn": 0, "lanckriet": [0, 17, 18, 27], "land": [], "lang": 0, "langaug": [24, 28], "languag": [0, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 20, 23, 24, 25, 27, 28], "lanzendarferppw": [], "lanzendarferppw24": [], "lanzendorf": [0, 8], "lanzendorfer_blap_nod": [], "lanzend\u00e3": [], "larg": [0, 4, 5, 6, 7, 8, 11, 18, 19, 20, 23, 24, 25, 26, 27, 28], "larger": [4, 19, 24], "larson": [0, 8], "last": [4, 7, 10, 16, 19, 24], "last_hidden_st": 24, "lastli": [20, 21], "late": [0, 6, 8], "latenc": [19, 20], "latent": [0, 2, 8, 11, 13, 14, 24, 28], "later": [8, 11, 13, 19, 22], "latest": [7, 15, 22], "latin": 20, "latter": 9, "lattner": 0, "lau": [], "launch": 15, "laura": [], "laurent": [], "laurier": [0, 17], "lav": [], "lawrenc": [], "layer": [0, 8, 11, 19, 21, 27, 28], "lazi": [], "lazo": [], "lazy_load": [], "lbd": [0, 6], "lcy": [0, 11, 13], "ldm": 0, "ldot": [8, 9, 26], "le": [0, 8, 18], "leach": [], "lead": [6, 12, 13, 19, 23, 25, 27, 28], "learn": [0, 2, 3, 4, 7, 8, 9, 11, 12, 13, 15, 17, 18, 21, 22, 23, 25, 26, 27], "learnabl": [8, 28], "learner": [0, 4, 16, 18], "learnt": 4, "least": [6, 11, 26, 28], "leav": 23, "led": 13, "lee": [0, 6, 8, 9, 17, 18, 23, 28], "lee10": [0, 23], "leemput": [], "left": [8, 11, 19], "legend": 16, "legg": 0, "lehtinen": [], "lei": [], "leibler": 12, "lejaren": 0, "lejun": 0, "len": [4, 11, 16, 24], "leng": [0, 6], "length": [5, 6, 9, 10, 11, 19, 20], "lenient": 26, "leo": 0, "leonard": 0, "leoni": [0, 18], "leonid": [0, 8], "lerman": [0, 8, 9], "less": [12, 19, 20, 26], "lester": [0, 18], "leszczynski": [0, 18, 25], "let": [2, 6, 8, 9, 16, 19, 20, 21, 28], "letman": [], "letter": [0, 9, 19], "level": [0, 2, 6, 7, 8, 9, 11, 12, 16, 22, 27, 28], "leverag": [2, 3, 4, 8, 12, 13, 14, 16, 19, 25], "levi": 0, "levin": [], "lexic": 6, "lezcano": [], "lgw": [0, 2], "lhss24": [0, 5, 8], "li": [0, 8, 9, 12, 18, 28], "liang": 0, "liao": [0, 6, 8], "lib": [4, 10, 24], "librari": [], "librosa": 10, "licens": [4, 5], "lick": 5, "lifeng": [], "light": 8, "lightn": [], "lightning_util": [], "lightweight": 8, "like": [2, 5, 6, 8, 9, 10, 11, 13, 15, 16, 18, 19, 20, 21, 22, 23, 25, 26, 27, 28], "likelihood": 17, "lim": 0, "limit": [0, 3, 8, 9, 13, 16, 17, 18, 20, 21, 22, 27, 28], "lin": [0, 6, 9], "lina": [], "linalg": 16, "line": [4, 8, 10, 16, 19, 21, 24, 28], "line2d": 16, "linear": [4, 6, 8, 11, 24], "lingelbach": [], "linguist": [0, 5, 6, 8, 9], "link": [4, 8], "linkifi": [], "linmiao": [], "linux": 10, "linyang": [0, 8], "lior": [0, 17], "lipman": [], "list": [4, 11, 19, 24, 26], "listen": [0, 4, 5, 8, 10, 17, 24], "literatur": [6, 9, 19], "littl": 6, "liu": [0, 5, 6, 8, 9, 18, 24, 28], "liu19": [0, 28], "liu_music_2024": [], "live": [5, 21], "liwei": [], "lka": [], "lkopf": [], "ll": [2, 4, 8, 10, 11, 21, 22], "ll24": [0, 6], "llama": [0, 5, 8, 19], "llark": [0, 8, 9, 18], "ller": [], "llion": 0, "llm": [0, 3, 5, 18, 19, 20, 28], "llvmlite": [], "lm": [2, 11, 13, 18], "lmn23": [], "lmnt": [], "lmz": [], "ln17": [0, 9], "load": [10, 24], "load_dataset": [4, 5, 24], "loader": [], "local": [0, 4, 10, 24], "local_attent": [], "localis": 9, "locat": 21, "lockhart": [], "log": [11, 19, 24, 28], "logit": [4, 17, 19, 24, 27], "logit_scal": 24, "loi": 0, "london": 15, "long": [0, 6, 8, 9, 11, 12, 17, 18, 19, 21, 22, 24, 27], "longbo": [], "longer": [6, 9, 11, 19], "longest": [4, 6, 24], "longpr": [0, 18], "look": [7, 9, 11, 18, 19, 21, 23, 26, 27], "looper": 2, "lope": [], "lorenz": [], "loss": [3, 4, 14, 19, 23, 24, 25], "loss_a2t": 24, "loss_t2a": 24, "lost": [], "lot": [11, 16, 19, 21], "loui": [], "low": [2, 12, 16, 27], "lower": [12, 19, 26], "lowest": 19, "lp": [0, 4, 5, 8, 18, 24], "lpg": [], "lppw24": [0, 8], "lr": [4, 24], "lsp": [0, 9], "lstm": 22, "ltgm19": [], "lth": [], "ltl": [], "ltl24": [0, 8], "ltu": 8, "lu": [0, 8, 9, 17], "luan": [0, 18], "luca": [0, 8], "luckili": 2, "lueb": 0, "luk": [], "lukasz": 0, "luke": [0, 17, 18, 27], "lun": [0, 6, 8, 18], "lunch": [], "luo": [0, 8], "luong": [], "lupe": [], "lv": [0, 6], "lvmin": [], "lxjz23": [], "ly": [], "lyl": [0, 11], "lyric": [9, 16, 23], "lyt": [], "lzb": [], "lzg": [0, 25], "m": [0, 8, 9, 15, 18], "m1": 10, "m2ugen": [0, 5, 8], "ma": [0, 5, 8, 9], "maarten": [0, 18], "mac": 10, "mach": 0, "machan": [], "machin": [0, 6, 7, 8, 13, 15, 17, 18, 19, 21, 28], "maciej": [], "macosx_10_10_x86_64": [], "macosx_10_12_x86_64": [], "macosx_10_13_x86_64": [], "macosx_10_15_universal2": [], "macosx_10_15_x86_64": [], "macosx_10_5_x86_64": [], "macosx_10_6_intel": [], "macosx_10_9_intel": [], "macosx_10_9_universal2": [], "macosx_10_9_x86_64": [], "macosx_11_0_arm64": [], "macosx_14_0_x86_64": [], "madmom": [], "maestro": [], "magazin": [0, 17, 18], "magenta": 13, "magnatagatun": [4, 5, 8, 24, 28], "magnatagtun": [], "magnitud": 8, "maher": [], "maheswaranathan": [], "mahieux": [0, 17], "mai": [2, 5, 6, 9, 12, 13, 19, 21, 23, 25], "main": [0, 2, 5, 7, 8, 9, 10, 11, 20, 21, 27], "maintain": [23, 25, 27], "major": [6, 20, 21, 23, 28], "majumd": [0, 5], "make": [2, 3, 4, 5, 8, 9, 10, 19, 20, 21, 23, 25, 27, 28], "male": 5, "malici": 20, "malinowski": [], "manag": [20, 27], "manco": [0, 5, 6, 8, 9, 15, 18, 23, 28], "mancusi": 0, "mandic": 0, "maneesh": [], "mani": [2, 5, 6, 8, 9, 11, 16, 20, 21, 22, 23, 24, 28], "manifold": 16, "manilow": 0, "mannies": [], "manoj": [], "manor": 0, "manual": 11, "mao": [0, 6, 8], "map": [2, 4, 8, 9, 11, 13, 19, 21, 26, 28], "marc": 0, "marcel": [], "marco": [0, 5, 18], "mard": 5, "margin": [11, 28], "mari": 15, "mariani": 0, "marianna": [0, 18], "marini": [], "mario": [], "mark": [0, 8, 9, 13, 21], "markdown": [], "markdown2": [], "marker": 16, "markerfacecolor": 16, "markers": 16, "markov": 19, "markupsaf": [], "mart": 0, "martin": [0, 6, 8], "martiro": 0, "marvin": [], "mask": [0, 2, 4, 13, 18, 22, 28], "maskgit": 0, "massachusett": [0, 17, 27], "massiv": 21, "masterpiec": 4, "match": [12, 19, 21, 26, 27, 28], "matena": [0, 18], "materi": 15, "mateusz": [], "math": [11, 19], "mathbb": 11, "mathbf": 11, "mathcal": [9, 11, 28], "mathemat": [19, 22], "mathew": [], "mathews1969technologi": [], "mathrm": 11, "mathur": [], "matplotlib": 16, "matrix": 12, "matt": 0, "matthew": 0, "matthia": [], "mauricio": 0, "mauro": [0, 5, 18], "max": [10, 24, 28], "max_length": [4, 24], "maxim": [19, 28], "mayb": 2, "mb": [], "mbl10": [], "mbqf21": [0, 8, 9, 18], "mbqf22": [18, 28], "mbqf22a": 0, "mbqf22b": [0, 16], "mc": 24, "mcaulei": [0, 5, 8, 9, 15, 17, 18], "mccann": [], "mcfee": [], "mckee": [0, 5], "mcleavei": [], "mcy": [], "mdit": [], "mdurl": [], "me": [24, 28], "me14": [], "mean": [0, 2, 5, 6, 8, 9, 16, 17, 19, 24, 26, 27, 28], "meaning": [6, 27, 28], "measur": [4, 6, 12, 24, 25, 26, 28], "mechan": [2, 4, 8, 14, 21, 25, 27, 28], "media": [4, 11], "median": 26, "medic": [], "medium": 2, "meet": [0, 6, 8, 26], "megan": [0, 18, 25], "mehri": [0, 17], "mehrish": [], "mei": 0, "meinard": [], "mel": [11, 14, 21], "melanchol": [24, 28], "melechovski": [0, 5], "melgan": [], "melod": 13, "melodi": [2, 4, 5, 13], "member": 15, "memcnn": [], "memo": [], "memori": [], "meng": [], "menghan": [], "mengji": [0, 6, 8], "mengy": [], "menick": [], "mention": 16, "mert": [4, 28], "mesmer": [4, 24], "meta_db": 24, "metadata": [2, 16, 23, 24, 27, 28], "metal": 28, "meteor": 6, "meter": [], "method": [0, 2, 3, 8, 9, 12, 14, 16, 17, 18, 19, 21, 22, 28], "methodologi": [3, 15, 17, 18], "metric": [0, 3, 12, 18, 24, 26], "metzler": [], "mexico": [0, 5, 8, 9], "mfmw24": [], "mgg": [0, 2, 5], "mha": [], "mi": [], "miccai": [], "micha": 0, "michael": [0, 18], "michal": [], "micha\u00ebl": [], "michel": 0, "michigan": 15, "micro": [], "midi": [], "midinet": [0, 13], "might": [2, 4, 13, 19, 28], "migneco": [0, 9], "mihir": [], "miika": [], "mike": [], "mikhail": [], "mildenhal": [], "miller": [0, 17], "million": [5, 19], "mimic": 8, "min": [0, 24], "ming": [0, 17, 18, 28], "mingbo": [], "minghui": [], "mingni": [], "minguk": [], "mingz": [], "mini": [19, 28], "minim": [11, 19, 28], "minimum": 28, "minor": 10, "minz": [0, 4, 5, 9, 18, 23, 24, 28], "mir": [2, 3, 4, 9, 15, 16, 24], "mir_ev": [], "mishkin": [0, 18, 28], "misinform": 20, "mislead": 12, "mismatch": 2, "miss": [0, 3, 23, 25, 26], "mission": 20, "mit": [5, 17], "mitsubishi": 15, "mitsufuji": [0, 6, 8], "mix": [6, 8], "mixtur": 8, "mixup": [0, 18], "mjxz23": [0, 13], "mkg": [0, 13, 17], "mlm": 21, "mlp": [8, 11], "mlvalimaki23": [], "mlx": [], "mm24": [0, 2], "mmm": [], "mo": [], "modal": [0, 5, 8, 12, 13, 15, 19], "mode": [2, 19], "model": [0, 2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 15, 16, 17, 20, 23, 25, 27], "model_config": 10, "moder": 11, "modern": [11, 21, 23, 24], "modifi": [0, 8, 23], "modul": [2, 4, 8, 10, 11, 12, 14, 16, 24], "modulenotfounderror": [4, 10, 16, 24], "moham": [0, 17], "mohammad": [0, 17], "mojtaba": 0, "mokadi": [], "mold": 2, "molei": [], "molin": [], "monica": 0, "monitor": [0, 20], "mood": [2, 5, 9, 16, 17, 23, 27, 28], "moon": [], "moonseok": [], "moor": [], "mor": [0, 17], "more": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 27, 28], "morgan": [], "morph": 2, "morton": [0, 9], "mosseri": [], "most": [2, 4, 7, 8, 10, 11, 12, 16, 17, 19, 21, 22, 24, 28], "mostafa": [0, 18], "mostli": 23, "motion": [], "motiv": 11, "move": [4, 25], "mpmath": [], "mpt": 5, "mqa": [6, 8, 9], "mrr": 26, "msci": 15, "msd": 28, "msdm": 2, "msn24": [0, 28], "mssr23": [0, 5], "mtc": [], "mtg": [5, 8], "mtp": [0, 2], "mtt": [4, 24], "mu": [5, 8], "mucap": [5, 8], "much": [6, 8, 9, 10, 11, 21, 23], "muchomus": [0, 6, 8], "muedit": [5, 8], "muhammad": [], "mul": 10, "mulab": [4, 15, 24], "mulan": [0, 16, 18, 28], "mulap": 16, "mullama": 9, "muller15": [], "multi": [0, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 23, 25], "multiclass": 27, "multidict": [], "multilabel": [27, 28], "multimedia": [0, 17, 28], "multimod": [0, 3, 6, 9, 15, 18, 20, 28], "multipart": [], "multipl": [0, 6, 8, 9, 11, 16, 18, 19, 20, 21, 23, 25, 26, 27, 28], "multitask": [0, 4, 18], "multitrack": [0, 13], "murata": 0, "murtadha": [], "muscap": [0, 8, 9, 18], "muse": [], "musegan": [0, 13], "musemorphos": [], "music": [0, 3, 4, 6, 12, 13, 14, 16, 19, 21, 22, 23, 24, 25, 26, 27, 28], "musicbench": 5, "musiccap": [0, 4, 5, 8, 18, 24], "musiccaptioningmodel": 4, "musicfm": [4, 24], "musicfm_emb": [4, 24], "musicgen": 13, "musicgenerationtempl": [], "musichifi": 0, "musician": [2, 5], "musicinstruct": [5, 8], "musicldm": [0, 11, 13, 18], "musiclm": [0, 5, 13, 18], "musicmagu": [0, 2], "musicnet": 8, "musicqa": [5, 8], "musictextclip": [5, 8], "musictextdataset": [4, 24], "musicva": 13, "musika": [], "musilingo": [0, 5, 8, 9], "must": [4, 23, 25, 28], "mustango": [0, 2, 5], "mvdream": [], "mwd": [0, 5, 23], "mwpt18": [], "mwpt19": [0, 17], "my": [], "n": [0, 4, 5, 6, 8, 10, 11, 18, 22, 24, 28], "n_compon": 16, "n_step": 10, "naacl": [0, 5, 8, 9], "nabla_": 11, "naeem": [], "nah": [], "naik": [], "nal": [0, 17], "nam": [0, 8, 9, 15, 17, 18, 23, 25, 28], "namburi": [0, 5, 8, 9], "name": [4, 8, 10, 13, 15, 16, 24, 27, 28], "nameerror": 4, "nan": [0, 18], "nanci": [0, 6], "nanxin": [], "naoki": 0, "narang": [0, 18], "narrow": 23, "nash": 0, "natalia": [], "nataniel": [], "nathan": [0, 8], "nathana\u00e3": [], "nation": 20, "nativ": [], "nattanat": [], "natur": [0, 2, 4, 5, 7, 8, 9, 13, 17, 18, 19, 22, 23, 24, 25, 27, 28], "navercorp": 15, "navig": 7, "navonil": [0, 5], "na\u00efv": 21, "nc": [5, 24], "ncl": [0, 17, 18], "ncsoft": 15, "nd": 5, "nearli": 11, "necessari": [6, 26], "necessarili": [12, 19], "necessit": 20, "need": [0, 4, 8, 10, 12, 15, 16, 17, 19, 20, 21, 23, 24, 25, 26, 27], "neg": 26, "neil": [], "nessler": 0, "net": [0, 8, 11], "neteas": 8, "network": [0, 8, 9, 13, 14, 17, 18, 19, 21, 22, 28], "networkx": [], "neuraip": [], "neural": [0, 8, 9, 12, 13, 17, 18, 19, 21, 22, 28], "neurip": [0, 15], "neurocomput": [], "never": 28, "new": [0, 2, 4, 9, 11, 13, 14, 15, 17, 18, 19, 23, 24, 27, 28], "newcom": 3, "newer": [6, 8], "newman": [], "next": [4, 8, 9, 10, 11, 14, 15, 19, 21, 22], "nez": 0, "nezhurina": [0, 18], "nfeld": [], "nice": 11, "nichol": 0, "nichola": [0, 18], "nick": [], "nickson": 0, "nicola": [], "nie": [], "nielsen": 0, "nieto": [0, 28], "night": 24, "nikhil": [], "niki": 0, "nikita": [0, 8], "nikolau": [], "niru": [], "nistal": 0, "nlp": [3, 6], "nm": 24, "nmbkb24": [2, 18], "nmbkb24a": [0, 2, 11], "nmbkb24b": 0, "nn": [4, 24], "nniemi": [], "no_grad": [4, 16, 24], "noam": [0, 17, 18], "nois": 11, "noise2": [], "noisi": [11, 16, 28], "non": [8, 11, 13, 20], "none": 4, "nonequilibrium": [], "nonsens": 19, "norberto": [], "norm": [11, 16], "normal": [10, 11, 24, 26, 27], "norman": [], "norouzi": [0, 17], "notabl": [2, 11, 12, 13, 23], "notat": 11, "note": [0, 2, 4, 5, 8, 10, 11, 13, 17, 21], "notebook": [4, 10, 15, 24], "nou": [], "nouri": [], "nov": [], "novack": [0, 5, 8, 9, 15, 18], "novack2024prestod": [], "novel": [0, 3, 8, 16, 18, 27, 28], "novelti": [0, 18], "novemb": 15, "now": [2, 4, 8, 9, 10, 11, 15, 19, 20, 21, 24], "np": [16, 24], "npa": [0, 2], "nsynth": [13, 17], "nuanc": [9, 16, 25, 28], "null": [], "num_epoch": [4, 24], "num_return_sequ": 4, "num_work": 24, "numba": [], "number": [2, 6, 10, 11, 12, 21, 27], "numel": [4, 24], "numer": 11, "numpi": [10, 16, 24], "nvp": [], "ny": [0, 8], "nzc": [0, 11], "o": [10, 16], "o1": 19, "oasi": 27, "object": [6, 12, 14, 19, 28], "obtain": [5, 6, 8, 9, 12, 15, 19], "occupi": 7, "occur": 11, "octob": [0, 6, 8], "od": 0, "off": [9, 11], "offer": [11, 12, 15, 18, 25, 28], "often": [2, 5, 8, 9, 12, 16, 19, 20, 22, 23, 28], "of\u00e2": [0, 8], "oh": [], "oi": 0, "olaf": [], "older": 2, "olivi": [], "olv18": [0, 28], "omer": [], "ommer": [], "omran": [], "onc": [10, 11, 19], "one": [2, 5, 8, 9, 11, 14, 16, 17, 18, 19, 20, 21, 22, 26, 27, 28], "ones": [4, 21], "ongo": 20, "onli": [2, 4, 5, 6, 8, 9, 11, 15, 17, 19, 21, 25, 26, 27, 28], "onlin": [15, 24], "onto": 21, "ontologi": [], "ontrol": [], "oor": 0, "oord": [0, 17, 28], "oov": 27, "open": [0, 2, 6, 11, 19, 20, 23, 27, 28], "openai": [0, 13, 15, 18, 19], "openli": 19, "openmu": [0, 6, 8], "openreview": [0, 8], "oper": [2, 3, 11, 23, 27], "opera": [4, 24], "operat": 4, "operatornam": 8, "opportun": [23, 25], "optim": [0, 2, 4, 18, 19, 22, 24, 28], "option": [8, 10, 19], "orama": [0, 28], "oran": [], "orchestr": 4, "orchestra": 4, "order": [4, 8, 9, 11, 21], "org": [0, 5, 6, 8, 9], "organ": [0, 8], "organis": 7, "orient": [], "origin": [2, 4, 6, 11, 12, 13, 19, 21, 24], "orio": [], "oriol": [0, 17, 28], "orjson": [], "orthogon": [16, 19], "oscar": [0, 17], "other": [0, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 21, 22, 23, 28], "otherwis": 19, "our": [3, 8, 10, 11, 15, 19, 27], "out": [2, 3, 4, 8, 10, 11, 16, 18, 21, 24, 28], "outlier": 26, "output": [2, 4, 6, 8, 9, 10, 11, 12, 13, 17, 21, 22, 27, 28], "outsid": [15, 27], "ouyang": [0, 18], "over": [8, 11, 17, 21, 22, 25, 28], "overal": [2, 4, 6, 7, 8, 10, 11, 12, 19, 26], "overcom": [3, 6, 8], "overhead": 27, "overlap": 6, "overli": 21, "overview": [3, 8, 15, 22], "owj": [0, 18], "own": [4, 20], "p": [0, 4, 8, 9, 11, 17, 22, 24], "p310": [10, 15], "p_": [8, 11, 22], "p_0": 11, "p_1": 6, "p_i": [], "p_n": 6, "p_t": 11, "pablo": 0, "pachet": 0, "packag": [4, 10, 24], "pad": [4, 10, 24], "pad_token": 4, "pad_token_id": 4, "page": [3, 15], "pai": 11, "pair": [2, 4, 5, 7, 8, 12, 16, 19, 21, 28], "palett": [], "pamela": [0, 18, 28], "pan": [], "panda": 5, "pandei": [], "pandora": 15, "panel": [], "pann": [0, 12], "panorama": [], "paper": [0, 6, 8, 15, 16, 19, 21], "paradigm": [4, 7, 8, 11, 13], "paragraph": 19, "parallel": [10, 15, 21], "param": [4, 24], "paramet": [4, 8, 10, 11, 19, 22, 24, 28], "parameter": 11, "pardo": 0, "pareto": 19, "pari": [], "parikh": 0, "park": [0, 18, 28], "parker": 0, "parma": [4, 24], "parmaet": 22, "parmar": 0, "pars": [6, 11], "parser": [], "parso": [], "part": [4, 6, 7, 8, 11, 14, 17, 19, 21, 22, 24, 28], "partial": [6, 7, 10], "particip": [12, 15], "particular": [2, 4, 15], "particularli": [2, 5, 7, 8, 12, 13, 15, 17, 26, 27, 28], "partit": [], "partli": 21, "pasini": 0, "pass": [4, 8, 11], "passion": 15, "past": 21, "patashnik": [], "patch": 11, "patchifi": 11, "path": [5, 11], "pathak": [], "pathtool": [], "patrick": [0, 9], "pattern": [0, 5, 17, 19, 23, 26], "paul": [0, 17], "pauli": [], "pave": [13, 15], "pavlov": [], "payn": [0, 18], "pbar": [4, 24], "pcws22": [], "pd": 5, "peak": 10, "pedalboard": [], "peebl": 0, "peeter": [0, 9], "peizhao": [], "penalti": 6, "peng": [], "pengi": [0, 8], "peopl": [16, 23, 28], "per": [11, 19, 24, 28], "perceiv": [12, 28], "percept": 12, "perci": 0, "percuss": 15, "pereira": [0, 25], "perez": [], "perfect": [19, 24, 26], "perfecto": [0, 9], "perform": [0, 4, 8, 9, 11, 12, 13, 15, 16, 19, 24, 25, 26, 28], "perhap": [11, 23], "period": 2, "perplex": 16, "perraudin": [0, 8], "person": [], "personalis": 7, "perspect": [8, 12], "pertin": 18, "peter": [0, 18, 28], "pexpect": [], "pgpf23": [], "pgxh23": [], "ph": 15, "phase": [17, 18], "phbd03": [0, 9], "phd": [0, 15, 27], "phil": [], "philip": [0, 5, 23], "philipp": 0, "phillip": [], "philosophi": 8, "photorealist": [], "photoshopgenerativefil": [], "phrase": [16, 28], "physic": 15, "piano": [0, 15, 16, 28], "pianotre": 0, "pianotreeva": 13, "pick": [10, 19], "piec": [4, 9, 12, 17, 24, 26, 27, 28], "pierr": [], "pieter": 0, "pietquin": [], "pillow": [], "ping": [0, 8], "pink": 13, "pinkl": [0, 8], "pip": [4, 10, 15], "pipelin": 8, "pitch": [0, 13, 17], "pixel": 19, "pjbm22": [], "plai": [4, 10, 19], "plakal": [], "plan": 20, "platformdir": [], "platt": [], "play": 5, "playback": 7, "playground": [], "playlist": [0, 8, 9, 18, 25], "playntel": 8, "pleas": [3, 17], "plot": 19, "plotli": [], "plt": 16, "plugin": [], "plumblei": 0, "pmlr": [0, 17, 28], "point": [6, 11, 12, 19, 26], "polici": 19, "polit": 20, "polosukhin": 0, "polyak": [0, 17], "polyffus": 0, "polyfuss": 13, "polyphon": 0, "pon": 0, "pooch": [], "pool": 0, "poor": [8, 12], "pop": [0, 24, 27], "popular": [4, 7, 8, 19], "poria": [0, 5], "posit": [5, 7, 16, 21, 26, 28], "possess": 16, "possibl": [3, 6, 7, 8, 9, 10, 12, 13, 19, 21, 27, 28], "possibli": 5, "post": [8, 11, 19], "post0": [], "post1": [], "post2": [], "posterior": [], "postolach": 0, "potenti": [8, 12, 15, 18, 20], "power": [0, 4, 5, 8, 16, 19, 24, 28], "pp27": [], "pp32": [], "pp33": [], "ppo": 19, "prabhudesai": [], "practic": [2, 3, 7, 8, 11, 15, 18, 19, 21, 22, 26, 28], "practition": [26, 28], "prafulla": [0, 18], "pre": [0, 4, 5, 6, 8, 9, 11, 16, 18, 19, 27], "preced": [14, 21, 28], "precis": [6, 8, 15, 19], "predefin": [6, 9, 26, 28], "predict": [0, 9, 11, 13, 14, 17, 19, 21, 26, 27, 28], "predominantli": 23, "preechakul": [], "prefer": [0, 15, 18, 19, 23, 25], "prefigur": [], "prefix": [0, 4, 8, 11, 14, 19, 22], "prefix_length": 4, "prefix_mask": 4, "prefix_project": 4, "prem": 0, "prepar": 3, "preprint": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "preprocess": 14, "present": [7, 9, 11, 12, 15, 18, 23, 25, 28], "preserv": 4, "press": 0, "presto": [0, 15], "pretext": 19, "pretrain": [0, 9, 12, 14, 15, 16, 19, 24, 28], "pretti": 21, "prevent": 20, "previou": [8, 19, 21, 23, 25, 28], "previous": [11, 15, 25, 27], "primari": 9, "primarili": [13, 15, 17, 23, 28], "principl": [4, 13], "print": [4, 24], "prior": [2, 9], "pritch": [], "privat": 8, "pro": 24, "probabilist": 0, "probabl": [0, 8, 11, 13, 14, 19, 21, 22], "probe": 19, "problem": [16, 17, 18, 19, 20, 21, 23, 28], "problemat": 27, "proc": [0, 9], "proccess": 11, "proce": 11, "procedur": [], "proceed": [0, 6, 8, 9, 11, 18, 25], "process": [0, 2, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, 23, 25, 26, 27, 28], "prod_": [8, 9], "produc": [4, 5, 7, 8, 9, 19], "product": [0, 12, 15, 18, 19, 28], "program": [], "progress": [4, 5, 18, 19, 20], "progressbar": [], "proj": 11, "project": [4, 8, 11, 13, 21, 24, 28], "promin": [3, 24], "promis": [8, 19, 20, 25], "prompt": [0, 5, 6, 10, 11, 18, 19], "prone": 19, "pronounc": 6, "propag": [0, 17], "propcach": [], "properti": [4, 11, 24, 28], "proport": [19, 26], "proportion": 27, "propos": [9, 28], "protobuf": [], "protocol": 6, "proven": 22, "provid": [3, 5, 6, 7, 9, 10, 11, 12, 15, 19, 20, 21, 22, 23, 26, 27, 28], "proxim": 19, "pschluter22": [], "psdv": [], "pseudo": [0, 8, 16, 18], "psk": [0, 2], "psutil": [], "psycholog": [], "pt": [4, 24], "ptyprocess": [], "public": 10, "publish": [0, 15, 23], "puckett": [0, 17], "puhrsch": [], "pull": 2, "pumarola": [], "pure": [11, 18, 19, 25, 28], "purpos": [7, 8, 12, 13, 21], "push": 28, "put": [2, 4, 21], "pw": [0, 18, 28], "px23": [0, 11], "py": [4, 10, 24], "py2": [], "py3": [], "pycpars": [], "pycr": [], "pydant": [], "pydantic_cor": [], "pydub": [], "pygment": [], "pyloudnorm": [], "pynndesc": [], "pypars": [], "pyplot": 16, "pystoi": [], "python": [4, 15, 24], "python3": [4, 10, 24], "python_multipart": [], "pythonhost": [], "pytorch": [4, 24], "pytorch_lightn": [], "pytz": [], "pyviz": [], "pyviz_comm": [], "pywavelet": [], "pyyaml": [], "pzh": [], "q": [8, 14, 26], "qa": [], "qi": [0, 9], "qian": [0, 6, 8], "qiao": 0, "qifeng": [], "qing": [], "qingq": [0, 5, 18, 28], "qiu": [0, 8], "qiuqiang": 0, "qiyang": [], "quad": 11, "quadrat": 11, "qualit": 23, "qualiti": [0, 6, 13, 17, 18, 19, 26], "quantiz": [11, 14, 19], "queen": 15, "queri": [0, 5, 7, 8, 9, 14, 16, 18, 19, 24, 25, 26, 27], "query_vector": 24, "question": [0, 3, 5, 6, 8, 11, 19, 22], "quick": [10, 19], "quickli": [4, 19, 26], "quinton": [0, 6, 8, 9, 15, 18, 28], "quit": [2, 11], "qun": [], "quoc": [0, 18], "quot": 2, "qwen": [0, 8], "r": [0, 11, 15, 28], "r_1": 6, "rachel": [0, 8, 9, 18], "racial": 20, "radford": [0, 4, 18, 28], "radio": 28, "radiohead": 28, "radlinski": [0, 18, 25], "rafael": [0, 8], "raffel": [0, 18], "rag": 20, "rai": [0, 18], "ram": 0, "ramalingam": [], "ramesh": [0, 28], "ramsauer": 0, "randn": 10, "random": [4, 24], "random_st": 16, "randomnam": [], "rang": [2, 4, 6, 7, 8, 17, 18, 19, 22, 23, 24, 26, 27, 28], "rank": [19, 26, 27], "rank_q": 26, "rao": [], "rap": 16, "rapha": [], "raphael": [], "rapid": 15, "rapidli": 13, "raquel": [], "rare": [2, 21, 26, 28], "rashindra": [], "rate": [4, 10, 11, 12, 19, 24], "rather": [4, 8, 11, 12, 16, 17, 23, 27, 28], "rave": [0, 2], "ravi": [0, 18, 25, 28], "raw": [0, 11, 17, 24, 27, 28], "raymond": [0, 9], "rbl": [], "rdn": [], "re": [0, 3, 4, 7, 10, 11, 17, 22, 23, 24], "reach": 3, "reaction": 23, "readabl": 7, "reader": 11, "readi": 24, "real": [0, 2, 9, 12, 13, 15, 17, 19, 20, 23, 26, 27, 28], "realis": 8, "realist": [], "realiti": 16, "realiz": 19, "realli": [11, 19, 21, 22], "realm": [4, 24], "reaon": 12, "rearrang": [10, 21], "reason": [2, 6, 9, 11, 20], "recal": 6, "receiv": [8, 19], "recent": [4, 6, 7, 8, 9, 10, 13, 15, 16, 19, 20, 21, 22, 23, 24, 25, 28], "reciproc": 26, "recogn": [12, 19, 27], "recognis": [], "recognit": [0, 5, 9, 17, 27], "recommend": [0, 4, 5, 7, 10, 12, 17, 23, 25], "reconstruct": 14, "record": [], "recurr": [0, 9, 13, 22], "red": 21, "reduc": [6, 19, 20, 28], "refer": [0, 2, 11, 15, 19], "referenc": [], "refin": [23, 25], "reflect": [0, 6, 9, 12, 13, 19, 26, 28], "refram": 8, "regard": [7, 19, 27], "regardless": 28, "regener": 2, "regex": [], "regina": [], "region": [2, 11], "regress": [14, 18, 19], "regular": [11, 26], "reinforc": 19, "reinvent": 11, "reiss": [], "rel": [8, 9, 16, 17], "relat": [15, 16, 19, 20, 23, 27, 28], "relationship": [6, 9, 17, 21, 25, 28], "relax": 5, "releas": 21, "relev": [11, 15, 18, 19, 24, 25, 26, 27, 28], "reli": [6, 8, 12, 13], "reliabl": [6, 26], "relianc": 28, "religi": 20, "relu": [4, 24], "remain": [8, 11, 16, 20, 23, 25, 28], "remark": 15, "remedi": 19, "remez": [0, 18], "remi": 13, "remind": 28, "remov": 11, "ren": [], "render": 10, "renum": 5, "renumics___song": [], "rep": [], "repeat": 5, "repeatedli": 23, "repetition_penalti": 4, "replac": [11, 19], "repo": [], "report": [0, 16, 18], "repositori": 15, "repres": [7, 12, 13, 16, 17, 19, 21, 25, 28], "represent": [0, 4, 6, 8, 12, 15, 17, 19, 28], "repurpos": 2, "request": [16, 23], "requir": [2, 7, 8, 11, 12, 15, 19, 20, 23, 25, 26, 27, 28], "requires_grad": [4, 24], "requires_grad_": 4, "rer": [0, 13], "resampi": [], "research": [0, 3, 9, 12, 13, 15, 17, 18, 19, 20, 21, 22, 23, 25, 26, 28], "reshap": 4, "residu": [11, 14], "resize_token_embed": 4, "resnick": [0, 17], "reso": [], "resolut": [2, 11, 14], "resort": 21, "resourc": [7, 10, 19], "respect": [7, 8, 9, 28], "respons": [0, 5, 6, 8, 9, 12, 16, 19, 22, 23, 25], "respos": 8, "rest": [10, 11], "restrict": [8, 26, 27, 28], "result": [6, 8, 11, 12, 17, 19, 23, 26, 27, 28], "retain": 8, "rethink": [], "retriev": [0, 3, 6, 7, 8, 9, 13, 15, 16, 22, 26, 28], "retrieval_fn": 24, "return": [4, 5, 24, 26, 27], "return_tensor": [4, 24], "reus": 19, "reveal": 23, "revers": [0, 11], "review": [0, 3, 5, 6, 7, 8, 9, 11, 16, 18, 23, 28], "revisit": 23, "reward": 19, "rewon": [0, 18], "rez": 0, "rfb15": [], "rfer": [], "rg": [], "rgy": [0, 8, 9, 18, 28], "rhythm": [0, 2, 5], "rhythmic": 2, "ricardo": [], "riccardo": [], "rich": [16, 28], "richard": [], "richardson": [0, 9], "richer": [17, 19, 23, 28], "rif": [0, 13], "riff": [0, 2, 5, 24], "riffus": [0, 13], "rigel": [], "right": [5, 8, 11, 20], "rightarrow": [9, 11], "rinon": [], "rise": 17, "risk": 20, "rita": [0, 8], "rithesh": [0, 17], "ritter": [], "rkh": [0, 16, 28], "rkx": [], "rlhf": 19, "rlj": [], "rm": 19, "rmh": [], "rn": [], "rnn": [0, 8, 13, 21, 22], "rob92": [], "robbin": [], "robert": [0, 5, 17, 18], "roberta": [0, 21, 24, 28], "roberta_emb": 24, "robin": [], "roblek": 0, "robust": [25, 26, 28], "robustli": [0, 24, 28], "rock": [0, 2, 4, 5, 16, 17, 18, 24, 27, 28], "rod": [], "rodol": 0, "roform": [], "roger": [0, 8], "rohan": [0, 8], "role": 18, "roll": 5, "romain": [0, 8, 9], "rombach": [], "ron": 0, "rongchen": [0, 5, 8, 9], "rongji": [], "ronneberg": [], "room": 2, "root": [4, 24], "roshan": [], "rot92": [], "rotari": [], "rothstein": [], "roug": 6, "rouge_l": [], "rough": 11, "round": [6, 10], "rout": 8, "roux": 0, "rovan1997igm": 0, "row": 21, "royalti": [], "rpd": [], "rpg": [], "rsr": [0, 18], "rubinstein": [], "ruff": [], "rui": [], "ruibin": [0, 8], "ruihan": 0, "ruiz": [], "rule": 0, "run": [2, 10, 11, 15], "runner": [], "runtim": [4, 24], "runtimeerror": [], "russel": [0, 5], "rvq": 14, "rvqgan": 0, "rwc": [0, 18], "rwd97": [0, 13], "rxl": [], "s3f": [], "s4": 13, "s41592": [], "s_": 11, "sa": 5, "sabet": [], "sabour": [], "sadeep": [], "safehttpx": [], "safer": 20, "safetensor": [], "sageev": 0, "saharia": [], "sai": [8, 10, 21, 23], "sain": 0, "saito": [], "sakkeer": [0, 5, 8], "sal": [], "salamon": [0, 5, 28], "salient": [2, 8, 15], "saliman": 0, "salmonn": [0, 8], "sam": [0, 18], "same": [8, 11, 12, 16, 19, 20, 21, 28], "sameer": 0, "sami": [], "sampl": [2, 4, 8, 10, 11, 12, 19], "sample_r": 10, "sample_s": 10, "sampler": 10, "sampler_typ": 10, "samplernn": [0, 13, 17], "samuli": [], "san": 15, "sanakoyeu": [], "sander": [0, 11, 17], "sandhini": [0, 18, 28], "sandler": [0, 8, 9], "sang": [], "sanja": [], "sanjiv": [], "saroufim": [], "sashimi": 13, "sastri": [0, 28], "satisfact": [16, 26], "satisfi": [23, 26], "sauer": [], "saurabh": [], "saurou": [], "save": [10, 19], "savitzki": [], "saw": 19, "saxena": [], "saxophon": 28, "sbd": [], "sbr22": [], "sc": [], "scalabl": 0, "scale": [0, 8, 9, 10, 11, 12, 18, 28], "scatter": 16, "scc": [], "scdbk24": [0, 8], "scenario": [15, 23, 26, 28], "scene": [6, 11], "sch": [], "schedul": [], "schelten": [], "scheme": [8, 21], "schl": [], "schmidt": [0, 9], "schneider": [], "schoenfeld": [], "schulman": [], "schuster": [], "scienc": [0, 15, 23], "scientif": [], "scikit": [], "scikit_imag": [], "scikit_learn": [], "scipi": [], "scope": [16, 26, 28], "score": [0, 6, 8, 11, 19, 23, 26, 27], "scoroda18": [], "scott": [0, 9], "scratch": [2, 19, 23], "sd": 11, "sdcs23": [], "sdd": 5, "sde": [10, 11], "sdk": [], "sdwmg15": [], "search": [7, 15, 17, 18, 19, 23, 24, 25, 27, 28], "seb": [], "sebastian": [], "sec": [], "second": [0, 4, 8, 10, 11, 12, 17, 19, 21, 24, 27], "seconds_start": 10, "seconds_tot": 10, "secret": 10, "section": [4, 8, 9, 13, 14, 16, 19, 21, 22, 28], "see": [4, 6, 7, 8, 9, 11, 19, 21, 22], "seed": 10, "seek": [0, 2, 9, 11, 15, 23], "seem": 11, "seen": [2, 4, 8, 16, 19, 28], "seetharaman": 0, "segment": [9, 21, 22], "select": [12, 28], "self": [0, 4, 8, 10, 11, 15, 16, 19, 24, 28], "semant": [0, 6, 16, 17, 18, 19, 27, 28], "semantic_vers": [], "semanticscholar": 0, "semi": [0, 9], "senior": [0, 17], "sens": [11, 19, 21, 22, 24], "sensit": [12, 26], "sentenc": [6, 8, 9, 13, 16, 18, 21], "sentence_transform": 16, "sentencepiec": 21, "sentencetransform": 16, "sentri": [], "sentry_sdk": [], "seong": [], "separ": [0, 8, 11, 15, 19, 21, 27], "sepp": 0, "sequenc": [0, 8, 9, 10, 11, 19, 22, 28], "sequenti": [0, 4, 8, 24], "sergei": [], "sergio": [0, 28], "seri": [8, 11, 15], "serra": [0, 9, 28], "serv": [12, 14, 15, 16, 23, 25], "server": 15, "session": [0, 4, 8, 12], "set": [0, 5, 6, 8, 9, 10, 12, 18, 22, 25, 26, 27], "seth": 0, "setproctitl": [], "setup": 12, "setuptool": [], "seungheon": [0, 3, 5, 8, 15, 18, 23, 25, 28], "seungheond": 10, "seungjun": [], "seventh": [0, 8], "sever": [6, 7, 8, 9, 12, 19, 23, 25, 26, 28], "sexual": 20, "seybold": [], "seyedhosseini": 0, "sfg": [], "sfjb21": [], "sfk24": [], "sft": 19, "sg64": [], "sgz": [0, 12], "sh22": [], "shan": [0, 5, 8], "shang": [], "shansong": [0, 5, 8], "shaohan": [0, 28], "shaoq": [], "shaoshu": [], "shaozh": [], "shape": [4, 24, 28], "sharan": [0, 18], "share": [6, 17, 28], "sharifi": 0, "shawn": [], "shayn": [0, 18], "shazeer": [0, 18], "she": [4, 15], "shechtman": [], "sheld": 11, "shelf": 11, "shellingham": [], "shen": [], "sheng": [], "shengfeng": [], "sherlock": [], "shi": [], "shibuya": [], "shift": [7, 9, 11, 17, 21, 28], "shih": [0, 18], "shihao": [], "shiliang": [0, 8], "shinji": [0, 18], "ship": 19, "shiqi": [0, 6, 8], "shiran": [], "shivam": [], "shjl24": [], "shkk22": [], "shlomo": [0, 8, 9, 18, 28], "short": [0, 4, 9, 17], "shortcom": [6, 7], "shortli": 8, "shot": [0, 8, 9, 18, 28], "should": [4, 9, 10, 11, 15, 19, 25, 26], "show": [6, 16, 19, 26, 27, 28], "shown": [5, 8, 13, 19, 20, 21, 28], "shrirao": [], "shu": [0, 18, 25], "shuai": [0, 28], "shubham": [0, 17], "shuffl": [4, 24], "shunt": [], "shuo": [0, 6], "shuqi": [], "shusuk": [0, 6, 8], "shutterstock": 8, "shyamal": [0, 18], "si": [], "siang": 0, "sicong": [], "siddhartha": [0, 18], "side": 2, "siggraph": [], "sigir": [0, 18, 25], "sigma_max": 10, "sigma_min": 10, "signal": [0, 4, 5, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 28], "signatur": 13, "signifi": 15, "signific": [15, 23, 25, 27, 28], "significantli": [19, 28], "sil": [], "silenc": 10, "sim": 11, "simian": [], "similar": [0, 2, 6, 8, 11, 12, 17, 19, 21, 23, 24, 25, 26, 27], "similarity_metr": 24, "similarli": [6, 8, 28], "simon": [0, 8, 9, 18], "simonetta": [], "simonyan": [0, 17], "simpl": [0, 2, 4, 9, 13, 18, 19, 21, 25, 28], "simple_contrastive_loss": 24, "simpler": [10, 11, 21], "simplest": [2, 8, 9], "simpli": [2, 4, 8, 10, 11, 19, 22], "simplifi": [], "simultan": [0, 13, 16, 28], "sinc": [8, 11, 12, 13, 19, 21, 27], "sing": [0, 4, 15, 24], "singapor": [0, 8], "singer": [0, 4, 5, 24], "singh": [0, 8], "singl": [0, 8, 9, 10, 11, 12, 16, 17, 18, 26, 28], "singsong": [0, 2], "siraichi": [], "site": 10, "situat": [12, 19, 22], "sivic": [0, 5], "six": [], "siyu": [], "siyuan": [], "size": [4, 5, 8, 9, 11, 19, 21, 24, 28], "sjscholkopf23": [], "sk": [], "skals": [], "sketch": 8, "sketchnet": [0, 13], "skip": 19, "skip_special_token": 4, "sklearn": 16, "skoglund": [], "skyrocket": 4, "slama": [0, 18], "slbr23": [], "slc07": [0, 17], "slightli": [8, 21, 23], "slow": [19, 24], "slowdown": 11, "small": [4, 9, 11, 12, 19, 21, 26], "smaller": [19, 28], "sme20": [], "smith": [], "smitin": [0, 2], "smmap": [], "smollei": [], "smooth": 28, "snake": 11, "sne": 16, "sniffio": [], "so": [2, 4, 6, 8, 10, 11, 15, 19, 21, 22], "so17": [0, 13], "soar": 4, "social": [0, 16, 17], "societi": [0, 6, 8, 9, 15, 18, 20, 23], "soft": [5, 16, 28], "softmax": [8, 27], "softwar": 15, "soham": [0, 8], "sohl": 0, "sojoudi": [], "solid": 15, "solo": [4, 5, 28], "solut": [3, 18, 19], "solv": [11, 16, 19, 20, 21], "solver": 11, "somayeh": [], "some": [2, 4, 5, 6, 7, 8, 9, 10, 11, 13, 17, 19, 21, 22], "someth": [11, 19], "sometim": [8, 9, 19], "son": [], "song": [0, 2, 4, 5, 8, 9, 10, 23, 24, 27, 28], "soni": [13, 15], "soon": 13, "sophist": [3, 4, 8, 25, 28], "soprano": 4, "sora": 19, "sordo": [0, 17], "soroush": [0, 17], "sort": [2, 11, 26], "sot": 21, "sotelo": [0, 17], "soujanya": [0, 5], "soumith": [], "sound": [0, 4, 5, 8, 9, 11, 16, 17, 18, 24, 27, 28], "soundctm": [], "soundfil": [], "soundstorm": [], "soundstream": [], "sourc": [0, 5, 15, 17, 19], "sourcetensor": 4, "sourish": [], "southern": 15, "souza": [], "space": [0, 3, 4, 8, 11, 13, 16, 19, 23, 28], "spam": 20, "span": [7, 15, 18], "spanish": 24, "speak": 11, "special": [16, 19, 21, 25, 28], "specialis": 8, "specif": [2, 3, 4, 6, 8, 9, 12, 13, 15, 17, 18, 19, 21, 22, 27, 28], "specifi": 23, "speck": [0, 9], "spectral": 11, "spectrogram": [0, 11, 14, 21, 28], "spectrum": [13, 23], "speech": [0, 5, 8, 9, 12, 13, 17, 18, 19, 27, 28], "spend": 19, "spice": 6, "spider": 6, "spijkervet": 0, "spirit": 11, "spl": 0, "split": [4, 24], "spm": 15, "spotifi": 15, "springer": [0, 8, 28], "sqrt": 8, "squar": 21, "sr": 10, "src": 5, "srikumar": [0, 6, 8], "srivatsan": [0, 8], "ssdk": [0, 11], "ssw": 0, "stabil": [10, 15], "stabilityai": 10, "stabl": [0, 2, 11, 16], "stable_audio_tool": 10, "stableaudio": 13, "stack": 11, "staff": 15, "stage": [4, 8, 19], "standard": [2, 4, 6, 8, 11, 21, 22, 26], "stanlei": [], "starlett": [], "start": [6, 8, 9, 10, 11, 13, 17, 21, 22, 23], "startup": 13, "stasyuk": [], "state": [0, 2, 8, 9, 13], "static": [6, 9, 15, 19, 28], "statist": 12, "steadi": 20, "steer": 23, "steerabl": 0, "stefan": 0, "stefano": 0, "steinmetz": [], "stem": [2, 28], "stemgen": [0, 2], "step": [0, 6, 8, 10, 11, 12, 14, 19, 20, 22, 26], "stephen": [0, 17], "stereo": 0, "steven": [0, 5, 8, 9, 18], "stft": [11, 14], "still": [8, 12, 16, 19, 20, 21, 23], "stimulu": 12, "stitch": [], "stochast": [0, 11], "stoi": [], "stoller": [0, 8, 9, 18], "stop": 10, "store": [], "stori": 24, "straight": 19, "straightforward": [12, 21], "strategi": [0, 6, 18, 28], "strength": [11, 12], "strictli": 10, "string": [4, 11, 24], "strong": [2, 20, 24], "stronger": 28, "strongli": [2, 6], "strub": [], "struct": 7, "structur": [0, 2, 9, 13, 19, 23, 27, 28], "struggl": [19, 23, 28], "strum": 24, "student": [15, 19], "studi": [3, 6, 9, 11, 15, 17, 23, 27, 28], "style": [2, 9, 11, 13, 17, 23, 28], "su": [], "sub": [9, 21, 22], "subject": [0, 5, 6, 12], "sublinear": [], "submit": 19, "subscript": 22, "subsequ": [6, 13, 23], "subset": [4, 21, 24, 27], "substanti": [15, 28], "substr": 22, "subtl": [12, 28], "subword": [21, 28], "succeed": 3, "succes": 11, "success": [8, 21, 22, 25], "suffici": [19, 25, 27, 28], "suggest": [4, 8, 10, 23, 25], "suha": [], "suhail": [], "suit": 6, "suitabl": [5, 8, 9, 12], "suk": [], "sum": [4, 24], "sum_": [8, 26, 28], "sumbali": [], "summar": [12, 19], "summari": [7, 9], "summaris": 6, "summit": 20, "sun": [0, 5, 6, 8, 13], "sungroh": [], "sunni": [5, 28], "suno": [0, 13], "suo": [], "supasorn": [], "super": [4, 24], "superior": 12, "supervis": [0, 8, 9, 15, 17, 18, 19, 21, 27, 28], "supplement": [5, 15, 24], "suppli": 21, "support": [7, 10, 17, 20, 23, 24, 25], "suppos": 21, "sure": [4, 10, 24], "surround": [19, 21, 22, 28], "survei": [0, 15, 17], "surya": [], "sustain": 25, "sutskev": [0, 18], "suttisak": [], "suwajanakorn": [], "svn37": [], "swap": [0, 28], "swave": [], "sweep": 10, "sweet": 3, "swerski": [], "swiss": [0, 6, 8], "swy": [], "sylvain": [], "symbol": [0, 13], "sympi": [], "synchron": [0, 18], "synnaev": [0, 18], "syntact": 6, "synth": 5, "synthes": 13, "synthesi": [0, 13, 17], "synthet": [0, 7, 25], "system": [0, 2, 4, 6, 7, 8, 9, 10, 12, 16, 17, 18, 19, 23, 24, 26, 27, 28], "szk": [], "szu": [0, 17, 18], "t": [0, 2, 3, 5, 6, 8, 9, 10, 11, 16, 18, 19, 21, 23, 24, 27, 28], "t1": 10, "t5": [11, 14, 16, 19, 21, 22], "t_i": 28, "t_j": 28, "tabl": [8, 19, 21], "tackl": [8, 28], "taehong": [], "taesu": [0, 23, 25], "taesung": [], "tag": [0, 2, 4, 5, 9, 16, 17, 18, 24, 27], "tagliasacchi": [0, 5, 18], "tai": [0, 18], "taigman": [0, 17], "tak": 8, "takahashi": [0, 6, 8], "takashi": [], "take": [2, 4, 6, 7, 8, 11, 14, 19, 20, 27, 28], "takida": 0, "tal": [0, 18], "talent": [4, 24], "tali": [], "talk": [0, 2, 7, 19, 25], "tallini": 0, "tan": [0, 8], "tang": [0, 8], "tanh": 8, "tao": [0, 8], "tar": [], "target": [8, 11, 14, 27], "task": [0, 3, 4, 6, 7, 8, 11, 12, 13, 14, 15, 18, 20, 21, 26, 27], "taslp": 15, "tat": [], "tau": 28, "taylor": [0, 8, 15, 18, 28], "tb": 10, "tb_name": 10, "tbtl08": [0, 17, 18, 27], "tc02": [0, 9], "te_dataload": [4, 24], "tea": 5, "teach": [0, 15, 17, 18], "teacher": 19, "teboul": [], "tech": [], "technic": [0, 15, 18, 22, 23], "techniqu": [0, 8, 12, 15, 18, 21], "technologi": [0, 15, 17, 18, 23, 27], "teh": 2, "tejasvi": [0, 18, 25], "telecommun": [0, 12], "tell": 24, "temperatur": [4, 24], "templat": 5, "tempo": [5, 16, 17, 24, 27], "tempor": [0, 2, 4, 5, 8, 9, 11], "ten": 19, "tenac": [], "tencent": 15, "tend": [2, 20, 23], "tendenc": 26, "tenenbaum": [], "tensor": [4, 24], "tensor_numpi": [], "tensorboard": [], "tensorboard_data_serv": [], "teoh": [], "ter": [], "term": [0, 2, 6, 7, 8, 9, 11, 12, 16, 17, 18, 19, 27, 28], "termcolor": [], "tero": [], "test": [4, 19, 20, 24], "test_dataset": [4, 24], "tester": 12, "teuwen": [], "text": [0, 3, 4, 6, 7, 8, 9, 10, 13, 14, 15, 16, 18, 19, 22, 23, 24, 26, 27], "text2song": 2, "text_embedding_dim": [4, 24], "text_encod": 24, "text_forward": 24, "text_model": 4, "text_output": 24, "text_project": 24, "text_token": [4, 24], "textrm": [11, 22], "textsubscript": 6, "textual": [0, 2, 12, 13, 18, 28], "textur": 23, "textwrap": 4, "tf": 6, "th20": [], "thabet": [], "thabo": [], "thailand": [0, 6, 8], "than": [2, 4, 8, 9, 11, 12, 15, 16, 17, 19, 20, 23, 27, 28], "thang": [], "thank": [16, 28], "thdl24": [0, 13], "thei": [4, 6, 7, 8, 9, 11, 15, 16, 19, 21, 22, 23, 25, 28], "them": [2, 4, 6, 8, 10, 11, 12, 13, 19, 21, 28], "theme": [17, 28], "theori": 8, "therebi": 15, "therefor": [8, 9, 12], "thereof": 8, "theres": 11, "thermodynam": [], "thesi": [0, 15, 27], "theta": [8, 11, 22], "the\u00e2": [0, 8], "thi": [2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], "thibault": [], "thickstun": 0, "thierri": [0, 17], "thing": [2, 4, 10, 11, 20, 24], "think": [0, 4, 8, 9, 11, 19, 28], "third": [0, 8], "thirti": [0, 8], "thoma": 0, "those": [8, 9, 19, 20, 21, 22], "though": [8, 9], "thread": 8, "threadpoolctl": [], "three": [5, 12, 13, 19, 21, 23], "threshold": 26, "through": [0, 2, 3, 4, 5, 7, 8, 9, 10, 13, 15, 16, 17, 18, 19, 20, 23, 24, 25, 27, 28], "throughout": 5, "throughput": 20, "thu": [11, 15], "ti": 8, "tian": [0, 8], "tianqi": [], "tianwei": [], "tianxiang": [0, 8], "tianyu": [0, 28], "tie": [0, 9], "tier": 15, "tifffil": [], "tight_layout": 16, "tillet": [], "tim": 0, "timbr": [2, 16], "timbretron": [], "time": [0, 2, 8, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 25, 27, 28], "timeless": 4, "timelin": 13, "timescal": [8, 9], "timestep": 11, "timo": [0, 5, 18], "ting": [0, 17], "tinghui": [], "tip": 24, "titl": 16, "tl89": [0, 13], "tn": 26, "to_html": 5, "todai": [7, 8, 24], "todd": 0, "todo": [], "togeth": 28, "token": [0, 3, 4, 6, 8, 9, 10, 11, 14, 19, 20, 28], "tokenization_utils_bas": [4, 24], "tokenizers_parallel": 10, "tom": 0, "tomer": 0, "tomlkit": [], "tommi": [], "too": [4, 19, 22, 24, 26], "tool": [4, 10, 13, 18, 25], "toolkit": [], "top": [8, 15, 19, 21, 26, 28], "top_k": 4, "top_p": 4, "topic": [3, 9, 13, 15, 19, 23], "topk": 24, "torch": [4, 10, 16, 24], "torch_stoi": [], "torchaudio": [4, 10, 24], "torchdiffeq": [], "torchlibrosa": [], "torchmetr": [], "torchsd": 10, "torchvis": 10, "tornado": [], "torr": [0, 17, 18, 27], "toshimitsu": 0, "total": 10, "total_loss": [4, 24], "toutanova": [0, 18, 28], "tov": [], "tovstogan": [0, 5, 23], "toward": [0, 5, 7, 8, 9, 11, 17, 18, 20, 23], "to\u00e2": [0, 8], "tp": 26, "tqdm": [4, 24], "tr": [], "tr_dataload": [4, 24], "trace": [3, 7, 18], "traceback": [4, 10, 16, 24], "track": [0, 5, 6, 8, 9, 17, 27, 28], "track2emb": [4, 24], "track_id": [4, 24], "tradeoff": 20, "tradit": [3, 13, 25, 27], "tradition": 9, "train": [0, 2, 3, 5, 6, 7, 8, 9, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 27], "train_dataset": [4, 24], "train_loss": [4, 24], "train_parma": [4, 24], "traitlet": [], "trajectori": [0, 23], "trampolin": [], "tran": 0, "transact": [0, 9, 17, 18, 27], "transcript": 15, "transfer": [0, 2, 13, 18, 21, 28], "transform": [0, 4, 5, 8, 9, 11, 13, 14, 16, 18, 19, 21, 22, 24, 28], "transit": 18, "translat": [0, 4, 6, 7, 8, 11, 17, 19], "transpar": [], "transport": 4, "treat": [3, 8, 11, 21, 23, 27], "tremend": 21, "trend": [8, 13], "tri": [], "trick": 2, "trigger": [], "triplet": [3, 28], "true": [4, 10, 16, 24, 26], "truli": [2, 4], "truncat": [4, 24], "trung": [], "trust": 24, "truth": [4, 6, 24, 26], "try": [4, 16, 22, 24], "tsa": [0, 9], "tsai": [], "tsne": 16, "tsung": [0, 8, 9], "ttm": [2, 11], "ttmr": 16, "tu": [], "tune": [4, 8, 19, 24], "tuoma": [], "tupl": 9, "turab": 0, "turbo": [], "turn": [9, 10, 11, 18, 21], "turnbul": [0, 9, 17, 18, 27], "tutoir": [], "tutori": [2, 4, 9, 11, 13, 15, 16, 17, 19, 21, 22, 24], "twelfth": [0, 8], "two": [0, 3, 7, 8, 12, 13, 14, 17, 21, 27, 28], "txt": 15, "ty": [0, 8], "type": [4, 5, 7, 8, 12, 13, 21], "typer": [], "typic": [6, 7, 8, 9, 12, 19, 23, 27, 28], "tzanetaki": [0, 9], "tzdata": [], "tzg": [0, 2], "u": [3, 4, 11, 19, 20, 22, 28], "uc": [], "ucsd": [], "udi": [0, 13], "udio": [0, 13], "uesaka": 0, "ugen": [], "uh": [], "ultim": 12, "umap": [], "umap_learn": [], "umbrella": [7, 8], "umg": 15, "un": 21, "unabl": 6, "unannot": 27, "unattribut": 2, "uncommon": 21, "uncondit": [0, 17], "under": [8, 15, 27], "underbrac": 22, "undergo": 8, "understand": [0, 4, 5, 6, 8, 9, 11, 12, 13, 15, 16, 17, 18, 22, 23, 25, 26, 28], "understnad": 24, "understood": 20, "unequivoc": 8, "unfamiliar": 28, "unfeas": 27, "unforgett": 21, "unfortun": 23, "uni": [], "uni01": [0, 12], "unifi": [0, 8, 16, 18], "unigram": 6, "union": 0, "uniqu": [3, 12, 20], "unit": [0, 6, 8, 9, 28], "univ": [], "univers": [0, 6, 8, 15, 17, 18, 28], "unknown": [21, 28], "unlabel": [7, 21], "unlik": [11, 12, 19, 21, 25, 28], "unlimit": 28, "unrel": 28, "unresolv": 15, "unrestrict": 27, "unrol": 8, "unsatisfactori": 23, "unseen": [17, 21], "unsqueez": 4, "unstabl": 2, "unsupervis": [0, 4, 18], "unterthin": 0, "until": [11, 22], "unus": 2, "up": [2, 6, 10, 11, 19, 20, 21], "upbeat": [2, 5, 23, 24, 25, 27, 28], "updat": [0, 4, 19], "upend": 2, "uplift": 5, "upload": 10, "upon": 23, "uriel": 0, "url": [0, 5, 6, 8, 9], "urllib3": [], "urtasun": [], "us": [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28], "usa": 15, "usabl": 16, "usag": [17, 20, 26], "usai": [], "user": [0, 6, 8, 9, 10, 15, 16, 18, 19, 23, 24, 25, 26, 27, 28], "userwarn": [4, 10], "usic": [], "usr": [4, 24], "usual": [6, 7, 8, 9, 22, 27], "uszkoreit": 0, "utf": 5, "util": [4, 14, 24, 28], "utilis": 8, "uvicorn": [], "v": [0, 8, 9, 11, 14, 18], "v1": [0, 6, 8, 9], "v2": [], "v3": [], "v4": [4, 24], "v_diffusion_pytorch": [], "va": [], "vae": [0, 11, 19], "vahdat": [], "vajda": [], "valid": [4, 6, 24], "valid_loss": [4, 24], "vall": [0, 8], "valu": [6, 11, 12, 14, 19, 26, 27], "valuabl": [12, 23, 25], "vampnet": [0, 2, 13], "van": [0, 17, 28], "vandergheynst": [], "vari": [0, 2, 8, 11, 18], "variabl": [9, 10, 11], "varianc": 12, "variant": [6, 7, 9], "variat": [0, 6, 8, 9, 19, 23], "varieti": [7, 8, 16, 19, 21, 23], "variou": [5, 15, 16, 18, 19, 27, 28], "varun": [], "vascipy10contributors20": [], "vast": 28, "vastli": 21, "vasudevan": 0, "vaswani": 0, "vdodz": [0, 13, 17], "vdov": [], "ve": [2, 3, 4, 8, 24], "vector": [0, 2, 8, 11, 14, 18, 19, 24, 28], "vector_quantize_pytorch": [], "veit": [], "ventur": 2, "venv": [], "veri": [6, 10, 19, 21, 27], "vers": 2, "versatil": [], "version": [4, 11, 12, 28], "versu": 28, "verzetti": [0, 5, 18], "vesa": [], "vggish": 12, "via": [0, 5, 6, 8, 14], "vibe": [5, 24], "vicki": 0, "vicol": [], "video": [0, 5, 7], "videocrafter1": [], "view": [19, 27], "vijai": 0, "vincent": [0, 18], "vinh": [], "vinyal": [0, 17, 28], "violin": [4, 24], "virtanen": [], "virtual": [10, 19, 28], "visheratin": [], "visio": 0, "vision": [0, 5, 6, 16], "visit": [0, 5, 6, 8, 9], "visual": [0, 8, 16, 28], "vivek": [0, 6, 8], "vocabulari": [9, 16, 17, 18, 21, 28], "vocal": [2, 4, 5, 24, 28], "vocalist": [4, 24], "vocod": [0, 11], "voic": [2, 4, 24], "volkmann": [], "volum": [0, 2, 5, 6, 8, 9, 19], "voss": [], "voznesenski": [], "vq": 19, "vqgan": 19, "vri": [], "vsp": [0, 14], "vulner": 24, "w": [0, 8, 11, 16], "wa": [4, 13, 15, 17, 19, 21, 24, 28], "wade": [], "wai": [2, 3, 5, 6, 8, 11, 13, 15, 16, 18, 19, 21, 23, 24, 27], "wainwright": [0, 18], "wakaki": [0, 6, 8], "walk": [0, 11, 25], "wallac": [], "wandb": [], "wang": [0, 6, 8, 18], "wang_self": [], "wanmo": [], "want": [2, 4, 5, 10, 17, 19, 21, 22, 23, 24, 27], "warn": [4, 10, 16, 24], "watanab": [0, 18], "watson": [], "wattenhof": [0, 8], "wav": 5, "wav2vec2featureextractor": 4, "waveform": [11, 17, 28], "wavegan": 17, "wavenet": [0, 13, 17, 19], "wbz": [0, 18], "wcmb": [], "wcs21": [0, 9], "wcwidth": [], "wcy22": [], "wcz": [0, 12, 28], "wdwb23": [0, 2, 11, 18], "wdwb24": [], "we": [2, 3, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 27, 28], "weak": [0, 12], "weaker": 16, "web": [0, 15, 19, 28], "webdataset": [], "webencod": [], "websocket": [], "weck": [0, 5, 6, 8, 23, 28], "weer": 0, "wei": [0, 6, 8, 9, 18, 28], "wei_finetuned_2021": [], "weigh": [8, 28], "weight": [4, 6, 8], "weihao": [], "weikang": [], "weili": [], "weiner": 11, "weird": 11, "weiss": [], "weituo": [0, 8, 9], "welcom": [15, 24], "well": [2, 4, 6, 8, 11, 12, 15, 16, 18, 19, 20, 22], "wen": [0, 5, 8, 9], "weng": [], "wenhao": [0, 5, 8, 9], "wenhu": [0, 5, 8, 9], "wenrui": [0, 6], "wenwu": 0, "wenyi": [0, 8], "wenyu": [0, 6], "were": [3, 6, 17, 19, 21, 22, 23, 27, 28], "werkzeug": [], "wgen23": [], "wget": [], "wgn23": [], "whang": [], "what": [8, 9, 11, 19, 21, 23, 27], "whb": [], "when": [3, 6, 7, 8, 9, 12, 14, 18, 19, 21, 22, 23, 26, 27, 28], "whenev": 17, "where": [2, 4, 8, 9, 10, 11, 12, 13, 15, 17, 19, 20, 21, 23, 24, 25, 26, 27, 28], "wherea": 22, "whether": [9, 17, 19, 26], "whi05": [0, 27], "which": [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 26, 28], "whiil": 2, "while": [2, 6, 8, 11, 12, 15, 16, 19, 20, 21, 23, 25, 28], "whisper": [15, 19, 21], "whistl": 5, "whitman": [0, 17, 27], "whl": [], "who": [2, 4, 24, 28], "whole": [9, 19], "wht24": [], "why": [11, 15, 19, 24], "wichern": 0, "wide": [7, 8, 12, 16, 17, 18, 19, 21], "wider": [6, 11, 16], "widmer": [], "width": [4, 11], "wikimut": [0, 5, 28], "wikipedia": 5, "wilei": [], "william": [0, 18], "wimbauer": [], "winter": [], "wise": [2, 11], "wish": [3, 8, 11], "within": [2, 4, 8, 9, 16, 26, 28], "without": [2, 6, 8, 13, 16, 18, 19, 23, 28], "wizadwongsa": [], "wjt": [], "wkgs24": [0, 28], "wmb": [0, 6, 8], "wmd": [], "wnn": [0, 5, 8, 9], "wojciech": 0, "wolf": [0, 17], "womanli": 4, "women": 4, "won": [0, 5, 9, 10, 18, 23, 28], "wong": [], "wook": [0, 15, 18, 28], "word": [0, 2, 11, 16, 17, 18, 19, 21, 22, 27, 28], "work": [2, 3, 8, 9, 11, 13, 15, 19, 21, 22, 24, 25, 28], "workflow": [2, 15], "workshop": [0, 8], "world": [4, 9, 15, 17, 19, 23, 26, 27, 28], "worst": 19, "worth": 8, "would": [8, 9, 11, 23, 27, 28], "wrap": [2, 4, 10], "wrapper": 10, "wrapt": [], "wright": 0, "write": [2, 4, 16], "writeup": 11, "written": [], "wte": 4, "wu": [0, 5, 8, 9, 18, 23, 28], "ww": [], "www": 0, "wxfx24": [], "wy23": [], "wyi3wkzjy": [0, 8], "wzl": [0, 6], "wzz": [0, 13], "x": [8, 9, 11, 17, 24], "x_": 28, "x_0": [], "x_1": [], "x_2": [], "x_m": [], "x_t": [], "x_transform": [], "xattn": 8, "xavier": [0, 9, 28], "xi": 0, "xia": 0, "xiang": [], "xiangyu": [], "xianzhao": [0, 8], "xiao": [], "xiaob": [0, 9], "xiaodong": [], "xiaogang": [], "xiaohuan": [0, 6, 8], "xiaoliang": [], "xiaoyu": [], "xie": [0, 5, 8, 9, 28], "xin": [0, 8], "xinchao": [], "xing": [], "xingjian": [], "xinhao": 0, "xintao": [], "xinyi": [0, 8], "xinyin": [], "xipeng": [0, 8], "xiufeng": [], "xu": [0, 6, 8, 18], "xubo": 0, "xuchan": [], "xuchen": [0, 8, 9], "xudong": [], "xue": [], "xuefeng": [], "xuezhi": [0, 18], "xun": [], "xunlong": [0, 6], "xyzservic": [], "xzy": [], "y": [0, 8, 9, 17, 18, 24], "y_": [8, 9], "y_1": [8, 9], "y_2": [8, 9], "y_t": [8, 9], "yael": [], "yan": [0, 8], "yanbo": [], "yang": [0, 6, 8, 17, 18], "yaniv": [0, 17], "yanqi": [0, 18], "yanzuo": [], "yao": [], "yaofang": [], "yariv": [], "yarl": [], "yaron": [], "yatong": [], "yazh": [0, 28], "ycontributors21": [], "ycy17": [0, 13], "ye": [0, 8, 27], "year": [7, 8, 9, 11, 22], "yee": [], "yellow": 13, "yen": [], "yeong": [], "yeongmin": [], "yesil": 0, "yet": [4, 8, 17], "yeung": 0, "ygp": [], "ygz": [], "yi": [0, 17, 18], "yichong": [0, 6], "yichun": [], "yijin": [], "yilun": [], "yin": 0, "yinfei": [], "ying": [0, 5, 8], "yinghai": [], "yinghao": [0, 5, 8, 9], "yinhan": [0, 24, 28], "yinhuai": [], "yiqin": [], "yixiao": [0, 5, 23], "yiyi": 0, "yk": [], "yoav": [0, 8, 9], "yogesh": [], "yong": [], "yonghui": 0, "yoo": [], "yoon": [], "york": 15, "yoshua": [0, 17], "yossi": [0, 18], "you": [0, 3, 4, 5, 9, 10, 11, 15, 17, 19, 21, 24], "youngjung": [], "youngmoo": [0, 9], "your": [0, 3, 4, 6, 8, 10, 24], "your_hf_token": 10, "yourself": [10, 15], "youtub": [], "youtube8m": [5, 8], "yt": 5, "yt8m": [], "yu": [0, 8, 17, 18], "yuan": [0, 8, 28], "yuancheng": [], "yuanjun": [0, 6], "yuanzhen": [], "yuanzhong": [], "yuchen": [0, 28], "yudong": [0, 5, 8, 9], "yue": [0, 8, 9, 18, 28], "yueh": [], "yufeng": [], "yuhao": [], "yuhta": 0, "yujiu": [], "yukara": 0, "yuki": [0, 6, 8], "yukio": [], "yuliang": [], "yulun": [], "yume": [], "yun": [0, 9], "yunfei": [0, 6, 8], "yunfeng": [], "yunhua": [0, 8], "yunjei": [], "yunji": [], "yunxuan": [0, 18], "yupe": 0, "yuqe": [], "yusong": [0, 5, 18, 23, 28], "yutong": 0, "yuval": [], "yuxi": [], "yuxuan": 0, "ywv": [0, 16], "ywz": [], "yxk": [], "yxl": [0, 6], "z": 11, "z1": 24, "z2": 24, "z_audio": 24, "z_text": 24, "zach": 0, "zachari": [0, 3, 5, 8, 9, 15, 18], "zack": 0, "zackeri": 19, "zada": [], "zal": [0, 5, 18], "zaremba": 0, "zcc": [], "zcdb24": [0, 11], "zdy": [0, 8], "zeghidour": [], "zehua": 0, "zejun": [0, 8], "zen": [0, 17], "zeqian": [], "zero": [0, 11, 18, 28], "zero_grad": [4, 24], "zettlemoy": [], "zeyu": [], "zhan": [0, 8], "zhang": [0, 5, 6, 8, 9, 17, 18, 23, 25, 28], "zhang_bertscore_2020": [], "zhao": [0, 6, 8, 18], "zhaoyang": [], "zhen": [], "zheng": [], "zhengdong": [], "zhengyuan": [0, 6], "zhenhui": [], "zhf": [], "zhi": [0, 6, 8], "zhide": [], "zhifeng": [0, 8], "zhigeng": [0, 8], "zhihong": [], "zhije": [], "zhiji": [0, 8], "zhijun": [0, 8], "zhiqi": [], "zhishuai": [], "zhiyao": 0, "zhizheng": [], "zhong": [0, 6, 8], "zhongyi": [], "zhou": [0, 6, 8, 18, 28], "zhouhang": [0, 5, 8, 9], "zhouyu": [0, 17], "zhu": 0, "zhuang": [], "zhuohan": [0, 6], "zhuoyuan": [0, 6, 8], "zihao": [0, 5, 8, 9], "zijian": [], "zip": [], "ziqi": [], "zirui": 0, "ziv": 0, "ziwei": [], "zix": [0, 2], "zixun": [0, 5], "ziyu": [0, 6], "zizheng": [], "zlo": [], "zongyu": [], "zoph": [0, 18], "zornitsa": [0, 8, 9], "zou": [0, 6], "zra23": [], "zuchao": [], "zukowski": 0, "zuluaga": 0, "zwcd23": [0, 11], "zzm": [0, 6, 8], "\u00e0": 0, "\u00e1": [0, 5, 18], "\u00e4": 0, "\u00e4\u00e4": [], "\u00e7": 0, "\u00e9": [0, 18], "\u00eb": 0, "\u00ed": 0, "\u00f6": [0, 8, 9, 18, 28], "\u00fc": [], "\u0103": [], "\u02c6": []}, "titles": ["Bibliography", "Beyond Audio Modality", "Beyond Text-Based Interactions", "Conclusion", "Code Practice", "Datasets", "Evaluation", "Introduction", "Models", "Tasks", "Code Tutorial", "Diffusion Model-based Text-to-Music Generation", "Evaluation", "Introduction", "MusicGEN", "Connecting Music Audio and Natural Language", "Why Natural Langauge?", "Background", "Overview of Tutorial", "Advances", "Challenges", "The Framework", "Introduction", "Challenges", "Code Practice", "Conversational Retrieval", "Evaluation", "Introduction", "Models"], "titleterms": {"": [4, 24], "1": [4, 16, 24], "2": [4, 16, 24], "3": [4, 16, 24], "4": [4, 24], "A": [], "And": 7, "In": 19, "The": [7, 21], "about": 15, "abstract": 2, "adapt": [8, 21], "address": [], "advanc": 19, "aim": 15, "align": 19, "almost": 16, "anchor": 12, "annot": 17, "answer": 9, "appli": 28, "ar": [8, 22], "architectur": [8, 11, 24, 28], "attent": 21, "attribut": 28, "audio": [1, 10, 12, 14, 15, 28], "audio2audio": 2, "augment": [19, 28], "author": 15, "automat": [], "autoregress": 21, "ax": 7, "background": 17, "base": [2, 6, 11], "benchmark": 6, "benefit": [25, 28], "beyond": [1, 2, 28], "bibliographi": 0, "brief": [], "build": [4, 24], "call": 19, "caption": [9, 23], "chain": 19, "challeng": [20, 23, 25], "channel": 21, "class": [4, 24], "classif": 9, "code": [4, 10, 24], "codec": 14, "complex": [], "concaten": 21, "conclus": [3, 4, 24], "condit": [8, 11, 21], "connect": 15, "context": 19, "continu": 11, "control": 2, "convers": [9, 25], "creat": [4, 24], "cross": 21, "data": [4, 24, 28], "databas": [], "dataset": [4, 5, 24], "decod": [8, 19, 21], "definit": 13, "denot": [], "describ": [], "descript": [5, 7, 8, 9, 18], "dialogu": [], "diffus": 11, "direct": 25, "distanc": 12, "distil": 19, "distribut": 23, "divers": [12, 28], "do": 7, "don": [], "earli": [17, 27], "effici": 20, "embed": 28, "emploi": 28, "encod": [8, 16, 19, 21], "engin": 24, "environ": [4, 24], "evalu": [6, 12, 26], "exampl": [], "fad": 12, "fall": [], "feedback": 19, "fid": 12, "framework": 21, "friendli": 16, "from": 19, "fr\u00e9chet": 12, "function": [19, 28], "further": 24, "fusion": 8, "futur": 25, "gener": [11, 17, 18, 19], "get": [4, 15, 24], "handl": 28, "hidden": 12, "histori": 13, "human": [5, 16, 19], "i": [7, 16, 28], "implement": 21, "incept": 12, "infer": 24, "initi": 28, "input": 19, "instruct": [], "interact": 2, "interfac": 16, "introduct": [4, 7, 13, 22, 24, 27], "iter": 11, "joint": 28, "k": 21, "kei": 25, "label": 16, "langaug": [16, 18], "languag": [15, 19, 21, 22], "law": 19, "learn": [16, 19, 24, 28], "let": [4, 24], "leverag": 28, "limit": [6, 12, 19, 23, 25], "listen": 12, "ll": 24, "llm": 8, "load": 4, "loss": 28, "make": 24, "mask": 21, "match": 6, "mc": [], "mean": 12, "method": 27, "metric": [6, 28], "mismatch": 23, "mo": 12, "modal": [1, 28], "model": [4, 8, 11, 18, 19, 21, 22, 24, 28], "modul": 21, "motiv": 15, "mqa": [], "mtc": [], "multi": 28, "multimod": [8, 19], "multipl": 12, "mushra": 12, "music": [2, 5, 7, 8, 9, 11, 15, 17, 18], "musiccap": [], "musicgen": 14, "musictextclip": [], "nativ": 8, "natur": [15, 16], "need": 7, "neg": 28, "neural": 14, "normal": 21, "open": 10, "opinion": 12, "other": [], "our": [4, 24], "out": 27, "output": 19, "overview": [7, 18], "paradigm": [], "part": [], "perform": 20, "practic": [4, 24], "pre": 28, "precis": 26, "prefix": 21, "prerequisit": [4, 24], "problem": [13, 27], "qualiti": 12, "queri": [23, 28], "question": 9, "rag": 19, "reason": 19, "recal": 26, "refer": [5, 6, 8, 9, 12, 17, 18, 23, 25, 27, 28], "refin": 11, "relev": 12, "represent": [11, 16, 21], "resourc": [4, 24], "result": 4, "retriev": [17, 18, 19, 23, 24, 25, 27], "safeti": 20, "sampl": 28, "scalabl": 16, "scale": 19, "score": 12, "sdd": [], "section": 7, "semntica": 28, "sentenc": 28, "sequenc": 21, "set": [4, 24], "shot": 19, "similar": 28, "singl": [23, 25], "song": [], "sourc": 28, "stabl": 10, "stableaudio": [], "stage": [17, 27], "start": [4, 15, 24], "static": [], "step": [4, 24], "still": [], "stimuli": 12, "strateg": 28, "supervis": 16, "synthet": 5, "system": 25, "t": [], "tag": 28, "tak": [], "task": [9, 16, 19], "technic": 25, "techniqu": 28, "test": 12, "text": [2, 5, 11, 12, 21, 28], "thi": 7, "thought": 19, "through": 11, "tip": 28, "token": 21, "tool": 19, "toward": 28, "train": [4, 24, 28], "transfer": 19, "transform": [], "trust": 20, "tune": [], "turn": [23, 25], "tutoir": [], "tutori": [7, 10, 18], "type": 9, "umbrella": [], "under": [], "understand": 24, "univers": 16, "up": [4, 24], "us": 19, "vocabulari": 27, "we": [4, 7, 24], "weak": 16, "what": [4, 7, 22, 24, 28], "why": [7, 16], "written": 5, "y": 16, "youtube8m": [], "yt8m": [], "z": 16, "zero": 19}}) \ No newline at end of file