diff --git a/_images/encoder_decoder.png b/_images/encoder_decoder.png
new file mode 100644
index 0000000..163d337
Binary files /dev/null and b/_images/encoder_decoder.png differ
diff --git a/_images/muchomusic.png b/_images/muchomusic.png
new file mode 100644
index 0000000..ea107eb
Binary files /dev/null and b/_images/muchomusic.png differ
diff --git a/_sources/description/datasets.ipynb b/_sources/description/datasets.ipynb
index 7218604..23b2055 100644
--- a/_sources/description/datasets.ipynb
+++ b/_sources/description/datasets.ipynb
@@ -9,6 +9,8 @@
"\n",
"Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn't directly distributed with the dataset and may be subject to copyright. \n",
"\n",
+ "```{table} Music description datasets.\n",
+ ":name: description_datasets\n",
"| Dataset | Content | Size (# annotations) | Accompanying Audio | Audio Length | Audio License| Text source | Dataset License\n",
"| ------- | ------ | ---- | ---- | ---- | ---- | ---- | ---- | \n",
"| [MusicCaps](https://www.kaggle.com/datasets/googleai/musiccaps) | Captions | 5.5k | ❌
(YT IDs from AudioSet)| 10s | - | Human-written (by musicians) | CC BY-SA 4.0 |\n",
@@ -22,7 +24,8 @@
"|[MUCaps](https://huggingface.co/datasets/M2UGen/MUCaps) | Captions | 22k | ❌ (YT IDs from AudioSet) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n",
"|[MuEdit](https://huggingface.co/datasets/M2UGen/MUEdit) | Music editing instructions | 11k | ❌
(MusicCaps) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n",
"|[FUTGA](https://huggingface.co/datasets/JoshuaW1997/FUTGA) | Captions (fine-grained) | 51.8k | ❌
(MusicCaps, Song Describer Dataset) | 2-5min | - | Synthetic (generated from audio via FUTGA) | Apache-2.0 |\n",
- "|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | - | - | Human-written (Amazon customers) | MIT |"
+ "|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | - | - | Human-written (Amazon customers) | MIT |\n",
+ "```"
]
},
{
diff --git a/_sources/description/evaluation.md b/_sources/description/evaluation.md
index 569bf8e..219caa7 100644
--- a/_sources/description/evaluation.md
+++ b/_sources/description/evaluation.md
@@ -1,6 +1,5 @@
(description_evaluation)=
# Evaluation
-
Reliably evaluating music description systems is a challenging endeavour. Even when we have "grounth-truth" captions, it is not always clear how to score generated text, as music description is open-ended, and at least partially subjective. The quality of a description is also strongly dependent on the context in which it is used. This issue gets even more pronounced with more dialogue-based tasks like MQA or other forms of instruction-based description.
Comparing outputs to gold standard from static datasets can help, but it's only the first step.
@@ -25,11 +24,20 @@ We briefly review each of these metrics below:
* **BERT-Score** also computes the similarity between tokens in a generated sentence and tokens in the ground-truth text, but does so using contextual embeddings obtained from a pre-trained BERT model instead of exact matches, resulting in a higher correlation with human judgements.
-## Other types of automatic evaluation
-* Multiple-choice question answering: MuChoMusic {cite}`weck_muchomusic_2024`
-* Other benchmarks: OpenMU {cite}`zhao_openmu_2024`
-* LLM-as-a-judge
-* Non audio: {cite}`li_music_2024`
+### Limitations
+While a useful starting point in evaluating model outputs on more closed-ended tasks, these metrics are unable to capture all admissable variations in music description. For example, given a music track, there may be several possible captions that are equally valid but share very little in terms of syntactic or semantic similarity. Both in the music domain and in others such as general audio description, many studies have highlighted important limitations of these metrics, for example showing they fail to account for valid variations in captions and to align with human judgement {cite}`lee2024captioningmetricsreflectmusic`. For this reason, including human evaluation and task-specific benchmarks is necessary for a more well-rounded evaluation.
+
+## Benchmarks
+To overcome some of the shortcomings of match-based metrics, a few benchmarks have recently emerged with the goal of assessing music understanding or description via multiple-choice question-answering. These also better suit the conversational format of more recent music description systems, as they focus on assessing responses to specific user prompts (questions). Some benchmarks of this kind are designed for general audio-language evaluation and include music as part of a wider range of domains. Among these are AudioBench {cite}`wang2024audiobench` and AIR-Bench {cite}`yang-etal-2024-air`. Others, including [MuChoMusic](https://mulab-mir.github.io/muchomusic/) {cite}`weck_muchomusic_2024` and [OpenMU](https://mzhaojp22.github.io/open_music_understanding/) {cite}`zhao_openmu_2024`, directly focus on music:
+
+```{figure} ./img/muchomusic.png
+---
+name: muchomusic
+width: 400px
+align: center
+---
+
+```
## References
diff --git a/_sources/description/models.md b/_sources/description/models.md
index 5602016..e3ba031 100644
--- a/_sources/description/models.md
+++ b/_sources/description/models.md
@@ -3,26 +3,30 @@
Deep learning models for music description via natural language typically fit into one of two designs:
-- Encoder-decoder
-- Multimodal Autoregressive Models
+- [Encoder-decoder](encoder_decoder_models) models
+- [Multimodal AR](multimodal_ar) models, most often in the form of [adapted LLMs](adapted_llms)
-In {numref}`description_models_table` below we give an overview of music description models from 2016 to today. * denotes taks that don't fall under the music description umbrella but are still addressed by the model.
+In {numref}`description_models_table` we give an overview of music description models from 2016 to today. * denotes taks that don't fall under the music description umbrella but are still addressed by the model.
```{table} Music description models.
:name: description_models_table
| Model | Type | Task(s) | Weights | Training dataset |
| ------- | ------ | ---- | ---- | ---- |
-| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Playlist captioning | ❌ | Private |
-| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private |
-| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ | |
-| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private |
-| BLAP {cite}`lanzendorfer_blap_2024` | Encoder-decoder | | | |
-| MuLLama {cite}`liu_music_2024` | Adapted LLM | | | |
-| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | | | |
-| M2UGen{cite}`hussain2023m` | Adapted LLM | | | |
-| LLark {cite}`gardner2023llark` | Adapted LLM | | | |
+| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Captioning (playlist) | ❌ | Private data |
+| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private data |
+| PlayNTell {cite}`gabbolini-etal-2022-data` | Encoder-decoder | Captioning (playlist) | ✅ [link]() | PlayNTell |
+| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ [link](https://huggingface.co/seungheondoh/lp-music-caps) | LP-MusicCaps |
+| ALCAP {cite}`he2023alcap` | Encoder-decoder | Captioning | ❌ | Song Interpretation Dataset, NetEase Cloud Music Review Dataset |
+| BLAP {cite}`lanzendorfer_blap_2024` | Adapted LLM | Captioning | ✅ [link](https://huggingface.co/Tino3141/blap/tree/main) | Shutterstock (31k clips) |
+| LLark {cite}`gardner2023llark` | Adapted LLM | Captioning, MQA | ❌ | MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune|
+| MU-LLaMA {cite}`liu_music_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://huggingface.co/mu-llama/MU-LLaMA/tree/main) | MusicQA |
+| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://github.com/zihaod/MusiLingo?tab=readme-ov-file#model-checkpoints) | MusicInstruct |
+| M2UGen{cite}`hussain2023m` | Adapted LLM | Captioning, MQA, music generation | ✅ [link](https://huggingface.co/M2UGen) | MUCaps, MUEdit |
+| OpenMU {cite}`zhao2024openmu` | Adapted LLM | Captioning, MQA | ✅ [link]() | MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune|
+| FUTGA {cite}`wu2024futga` | Adapted LLM | Captioning (fine-grained) | ✅ [link](https://huggingface.co/JoshuaW1997/FUTGA) | FUTGA|
```
+(encoder_decoder_models)=
## Encoder-Decoder Models
This is the modelling framework of the earliest DL music captioning models.
Encoder-decoder models first emerged in the context of sequence-to-sequence tasks (e.g. machine translation). It is easy to see many tasks can be cast as sequence-to-sequence, so encoder-decoder models found wide use in image captioning first, and audio captioning shortly after, including music.
@@ -89,7 +93,7 @@ where $\boldsymbol{w}_{a t t}$ and $\boldsymbol{W}^{a t t}$ are learnable parame
Similar types of attention-based fusion can also be used in Transformer-based architectures {cite}`gabbolini-etal-2022-data` {cite}`doh2023lp`. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:
$$
-\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}
+\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}.
$$
@@ -103,12 +107,13 @@ align: center
In addition to the type of mechanism used, depending on the level at which modalities are combined, it is also common to distinguish between *early* (i.e. at the input level), *intermediate* (at the level of latent representations produced by an intermediate step in the overall processing pipeline) or *late* fusion (i.e. at the output level). We note that the terms *early, intermediate* and *late* fusion do not have an unequivocal definition and are used slightly differently in different works.
+(multimodal_ar)=
+## Multimodal AR Models
+The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. We call these *adapted LLMs*. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.
-## Multimodal Autoregressive Models
-The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.
-
-Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding tasks by simply allowing users to query via text and obtain information about a given audio input. This is the machanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section.
+Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding and description tasks by simply allowing users to query via text and obtain information about a given audio input. This is the mechanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section.
+(adapted_llms)=
### Adapted LLMs
One modelling paradigm that has become particularly popular in audio description, including music, is that of adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities
such as audio. This is achieved via an *adapter* module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings.
@@ -121,31 +126,23 @@ align: center
---
```
-🚧
-
-Alongside music-specialised multimodal LLMs, a LLM with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:
-* SALMONN {cite}`tang_salmonn_2023`
-* Pengi {cite}`deshmukh_pengi_2023`
-* Qwen-Audio `chu_qwen-audio_2023`
-* LTU
-* [Audio-LLM: Activating the Capabilities of Large Language Models to Comprehend Audio Data](https://link.springer.com/chapter/10.1007/978-981-97-4399-5_13)
-
-We don't discuss these in detail, but their high-level design is similar to the music-specialised models we've seen in this section.
-
-#### Adapter Modules
+The architecture of the adapter modules employed in adapted LLMs for music typically consists of lightweight MLPs (between 2 and 3 hidden layers) or Q-Formers. Other architectures utilised in general audio adapted LLMs (or similar models in the visual domain) also include more complex designs such as Gated XATTN dense layers. [This blog post](https://lilianweng.github.io/posts/2022-06-09-vlm/) about Visual Language Models reviews these in more detail.
-#### Training
From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.
-##### Instruction Tuning
+Alongside music-specialised multimodal LLMs such as those in {numref}`description_models_table`, LLMs with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:
+* SALMONN {cite}`tang_salmonn_2023`
+* Pengi {cite}`deshmukh_pengi_2023`
+* Qwen-Audio {cite}`chu_qwen-audio_2023`
+* LTU {cite}`gong2023listen`
+* Audio Flamingo {cite}`kong2024audio_flamingo`
+* Audio-LLM {cite}`zhang2024audio_llm`
### Natively Multimodal AR Models
-Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
+Adapted LLMs allow to transform text-only LLMs into multimodal models relatively efficiently: based on the models discussed in this section, around 20-150k audio-text paired samples are required to perform the adaptation stage of training, while multimodal pre-training would require orders of magnitude more data. However, this also limits their performance and often results in a bias towards the language modality and poor audio and music understanding capabilities {cite}`weck_muchomusic_2024`. An alternative that promises to overcome this limitation is to instead adopt a natively multimodal approach to AR modelling. One key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
This paradigm is sometimes referred to as mixed-modal early-fusion modelling.
-It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have:
-* AnyGPT {cite}`doh2023lp`
-*
+It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models, such as AnyGPT {cite}`zhan-etal-2024-anygpt`, include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come.
## References
diff --git a/_sources/description/tasks.md b/_sources/description/tasks.md
index 1d5bd03..d06c7a5 100644
--- a/_sources/description/tasks.md
+++ b/_sources/description/tasks.md
@@ -87,7 +87,7 @@ align: center
A key difference between dialogue-based description and one-off captioning is that, instead of an `audio --> text` mapping, we are now dealing with an `(audio, text) --> text` mapping. This is reflected in the different model designs typically considered for these tasks (see [Models](description_models)). Differently from simple MQA, in music dialogue generation, responses are expected to be based on the entire dialogue history instead of only considering the current input.
-In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description-evaluation))!
+In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description_evaluation))!
## References
diff --git a/bibliography.html b/bibliography.html
index 5a68f36..b8cf727 100644
--- a/bibliography.html
+++ b/bibliography.html
@@ -487,6 +487,10 @@
Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 2392–2396. 2017. doi:10.1109/ICASSP.2017.7952585.
+Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. December 2023. arXiv:2311.07919 [cs, eess]. URL: http://arxiv.org/abs/2311.07919 (visited on 2024-02-26), doi:10.48550/arXiv.2311.07919.
+Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and others. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
@@ -611,6 +615,10 @@Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. It's raw! audio generation with state-space models. In International Conference on Machine Learning, ICML, volume 162 of Proceedings of Machine Learning Research, 7616–7633. PMLR, 2022.
Yuan Gong, Hongyin Luo, Alexander H Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.
+Daniel W. Griffin and Jae S. Lim. Signal estimation from modified short-time fourier transform. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 804–807. IEEE, 1983.
@@ -667,6 +675,10 @@Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley. Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process., 28:2880–2894, 2020.
Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities. In ICML. 2024. URL: https://openreview.net/forum?id=WYi3WKZjYe.
+Junghyun Koo, Gordon Wichern, Francois G Germain, Sameer Khurana, and Jonathan Le Roux. Smitin: self-monitored inference-time intervention for generative music transformers. arXiv preprint arXiv:2404.02252, 2024.
@@ -691,6 +703,10 @@Jin Ha Lee. Analysis of user needs and information features in natural language queries seeking music information. Journal of the American Society for Information Science and Technology, 61(5):1025–1045, 2010.
Jinwoo Lee and Kyogu Lee. Do captioning metrics reflect music semantic alignment? In International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD). 2024. URL: https://arxiv.org/abs/2411.11692.
+Jongpil Lee and Juhan Nam. Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Processing Letters, 24(8):1208–1212, 2017. doi:10.1109/LSP.2017.2713830.
@@ -703,9 +719,9 @@Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, and Tom Nickson. Controllable music production with diffusion models and guidance gradients. In Diffusion Models Workshop at NeurIPS. 2023.
Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models. arXiv preprint arXiv:2406.15885, 2024.
+Dongting Li, Chenchong Tang, and Han Liu. Audio-llm: activating theâ capabilities ofâ large language models toâ comprehend audio data. In Xinyi Le and Zhijun Zhang, editors, Advances in Neural Networks – ISNN 2024, 133–142. Singapore, 2024. Springer Nature Singapore.
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems (NeurIPS). 2017.
Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: a universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020, 2024.
+Ziyu Wang, Yiyi Zhang, Yixiao Zhang, Junyan Jiang, Ruihan Yang, Gus Xia, and Junbo Zhao. PIANOTREE VAE: structured representation learning for polyphonic music. In Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR, 368–375. 2020.
@@ -943,16 +963,28 @@Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. In Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR, 324–331. 2017.
Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, and Jingren Zhou. AIR-bench: benchmarking large audio-language models via generative comprehension. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1979–1998. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.109, doi:10.18653/v1/2024.acl-long.109.
+Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu-Gang Jiang, and Xipeng Qiu. AnyGPT: unified multimodal LLM with discrete sequence modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 9637–9662. Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.acl-long.521, doi:10.18653/v1/2024.acl-long.521.
+Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, and Simon Dixon. Musicmagus: zero-shot text-to-music editing via diffusion models. arXiv preprint arXiv:2402.06178, 2024.
Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. Openmu: your swiss army knife for music understanding. arXiv preprint arXiv:2410.15573, 2024.
+Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, and Yuki Mitsufuji. OpenMU: Your Swiss Army Knife for Music Understanding. October 2024. arXiv:2410.15573. URL: http://arxiv.org/abs/2410.15573 (visited on 2024-11-09).
Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn’t directly distributed with the dataset and may be subject to copyright.
-Dataset |
Content |
@@ -565,7 +566,7 @@
---|