Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ilaria-manco committed Dec 17, 2024
1 parent a06bd56 commit 3f374c5
Show file tree
Hide file tree
Showing 13 changed files with 272 additions and 188 deletions.
Binary file added _images/encoder_decoder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/muchomusic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion _sources/description/datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
"\n",
"Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn't directly distributed with the dataset and may be subject to copyright. \n",
"\n",
"```{table} Music description datasets.\n",
":name: description_datasets\n",
"| Dataset | Content | Size (# annotations) | Accompanying Audio | Audio Length | Audio License| Text source | Dataset License\n",
"| ------- | ------ | ---- | ---- | ---- | ---- | ---- | ---- | \n",
"| [MusicCaps](https://www.kaggle.com/datasets/googleai/musiccaps) | Captions | 5.5k | ❌ <br> (YT IDs from AudioSet)| 10s | - | Human-written (by musicians) | CC BY-SA 4.0 |\n",
Expand All @@ -22,7 +24,8 @@
"|[MUCaps](https://huggingface.co/datasets/M2UGen/MUCaps) | Captions | 22k | ❌ (YT IDs from AudioSet) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n",
"|[MuEdit](https://huggingface.co/datasets/M2UGen/MUEdit) | Music editing instructions | 11k | ❌ <br> (MusicCaps) | 10s | - | Synthetic (generated from audio via MU-LLaMA) | CC BY-NC-ND 4.0|\n",
"|[FUTGA](https://huggingface.co/datasets/JoshuaW1997/FUTGA) | Captions (fine-grained) | 51.8k | ❌ <br> (MusicCaps, Song Describer Dataset) | 2-5min | - | Synthetic (generated from audio via FUTGA) | Apache-2.0 |\n",
"|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | - | - | Human-written (Amazon customers) | MIT |"
"|[MARD](https://www.upf.edu/web/mtg/mard) | Album reviews | 264k| ❌ | - | - | Human-written (Amazon customers) | MIT |\n",
"```"
]
},
{
Expand Down
20 changes: 14 additions & 6 deletions _sources/description/evaluation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
(description_evaluation)=
# Evaluation

Reliably evaluating music description systems is a challenging endeavour. Even when we have "grounth-truth" captions, it is not always clear how to score generated text, as music description is open-ended, and at least partially subjective. The quality of a description is also strongly dependent on the context in which it is used. This issue gets even more pronounced with more dialogue-based tasks like MQA or other forms of instruction-based description.
Comparing outputs to gold standard from static datasets can help, but it's only the first step.

Expand All @@ -25,11 +24,20 @@ We briefly review each of these metrics below:

* **BERT-Score** also computes the similarity between tokens in a generated sentence and tokens in the ground-truth text, but does so using contextual embeddings obtained from a pre-trained BERT model instead of exact matches, resulting in a higher correlation with human judgements.

## Other types of automatic evaluation
* Multiple-choice question answering: MuChoMusic {cite}`weck_muchomusic_2024`
* Other benchmarks: OpenMU {cite}`zhao_openmu_2024`
* LLM-as-a-judge
* Non audio: {cite}`li_music_2024`
### Limitations
While a useful starting point in evaluating model outputs on more closed-ended tasks, these metrics are unable to capture all admissable variations in music description. For example, given a music track, there may be several possible captions that are equally valid but share very little in terms of syntactic or semantic similarity. Both in the music domain and in others such as general audio description, many studies have highlighted important limitations of these metrics, for example showing they fail to account for valid variations in captions and to align with human judgement {cite}`lee2024captioningmetricsreflectmusic`. For this reason, including human evaluation and task-specific benchmarks is necessary for a more well-rounded evaluation.

## Benchmarks
To overcome some of the shortcomings of match-based metrics, a few benchmarks have recently emerged with the goal of assessing music understanding or description via multiple-choice question-answering. These also better suit the conversational format of more recent music description systems, as they focus on assessing responses to specific user prompts (questions). Some benchmarks of this kind are designed for general audio-language evaluation and include music as part of a wider range of domains. Among these are AudioBench {cite}`wang2024audiobench` and AIR-Bench {cite}`yang-etal-2024-air`. Others, including [MuChoMusic](https://mulab-mir.github.io/muchomusic/) {cite}`weck_muchomusic_2024` and [OpenMU](https://mzhaojp22.github.io/open_music_understanding/) {cite}`zhao_openmu_2024`, directly focus on music:

```{figure} ./img/muchomusic.png
---
name: muchomusic
width: 400px
align: center
---
```

## References

Expand Down
67 changes: 32 additions & 35 deletions _sources/description/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,30 @@

Deep learning models for music description via natural language typically fit into one of two designs:

- Encoder-decoder
- Multimodal Autoregressive Models
- [Encoder-decoder](encoder_decoder_models) models
- [Multimodal AR](multimodal_ar) models, most often in the form of [adapted LLMs](adapted_llms)

In {numref}`description_models_table` below we give an overview of music description models from 2016 to today. * denotes taks that don't fall under the music description umbrella but are still addressed by the model.
In {numref}`description_models_table` we give an overview of music description models from 2016 to today. * denotes taks that don't fall under the music description umbrella but are still addressed by the model.

```{table} Music description models.
:name: description_models_table
| Model | Type | Task(s) | Weights | Training dataset |
| ------- | ------ | ---- | ---- | ---- |
| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Playlist captioning | ❌ | Private |
| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private |
| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ | |
| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private |
| BLAP {cite}`lanzendorfer_blap_2024` | Encoder-decoder | | | |
| MuLLama {cite}`liu_music_2024` | Adapted LLM | | | |
| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | | | |
| M2UGen{cite}`hussain2023m` | Adapted LLM | | | |
| LLark {cite}`gardner2023llark` | Adapted LLM | | | |
| Choi *et al.* {cite}`manco2021muscaps` | Encoder-decoder | Captioning (playlist) | ❌ | Private data |
| MusCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning, retrieval* | ❌ | Private data |
| PlayNTell {cite}`gabbolini-etal-2022-data` | Encoder-decoder | Captioning (playlist) | ✅ [link]() | PlayNTell |
| LP-MusicCaps {cite}`manco2021muscaps` | Encoder-decoder | Captioning | ✅ [link](https://huggingface.co/seungheondoh/lp-music-caps) | LP-MusicCaps |
| ALCAP {cite}`he2023alcap` | Encoder-decoder | Captioning | ❌ | Song Interpretation Dataset, NetEase Cloud Music Review Dataset |
| BLAP {cite}`lanzendorfer_blap_2024` | Adapted LLM | Captioning | ✅ [link](https://huggingface.co/Tino3141/blap/tree/main) | Shutterstock (31k clips) |
| LLark {cite}`gardner2023llark` | Adapted LLM | Captioning, MQA | ❌ | MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune|
| MU-LLaMA {cite}`liu_music_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://huggingface.co/mu-llama/MU-LLaMA/tree/main) | MusicQA |
| MusiLingo {cite}`deng_musilingo_2024` | Adapted LLM | Captioning, MQA | ✅ [link](https://github.com/zihaod/MusiLingo?tab=readme-ov-file#model-checkpoints) | MusicInstruct |
| M2UGen{cite}`hussain2023m` | Adapted LLM | Captioning, MQA, music generation | ✅ [link](https://huggingface.co/M2UGen) | MUCaps, MUEdit |
| OpenMU {cite}`zhao2024openmu` | Adapted LLM | Captioning, MQA | ✅ [link]() | MusicCaps, YouTube8M-MusicTextClips, MusicNet, FMA, MTG-Jamendo, MagnaTagATune|
| FUTGA {cite}`wu2024futga` | Adapted LLM | Captioning (fine-grained) | ✅ [link](https://huggingface.co/JoshuaW1997/FUTGA) | FUTGA|
```

(encoder_decoder_models)=
## Encoder-Decoder Models
This is the modelling framework of the earliest DL music captioning models.
Encoder-decoder models first emerged in the context of sequence-to-sequence tasks (e.g. machine translation). It is easy to see many tasks can be cast as sequence-to-sequence, so encoder-decoder models found wide use in image captioning first, and audio captioning shortly after, including music.
Expand Down Expand Up @@ -89,7 +93,7 @@ where $\boldsymbol{w}_{a t t}$ and $\boldsymbol{W}^{a t t}$ are learnable parame
Similar types of attention-based fusion can also be used in Transformer-based architectures {cite}`gabbolini-etal-2022-data` {cite}`doh2023lp`. In this setting, instead of the cross-attention shown above, fusion can also be directly embedded within the Transformer blocks by modifying their self-attention mechanism to depend on both text and audio embeddings, though exact implementations of co-attentional Transformer layers vary between models:

$$
\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}
\boldsymbol{A}\left(\boldsymbol{q}^{\text{text}}_{i}, \boldsymbol{K}^{\text{audio}}, \boldsymbol{V}^{\text{audio}}\right)=\operatorname{softmax}\left(\frac{\boldsymbol{q}^{\text{text}}_{i} K^{\text{audio}}}{\sqrt{d_{k}}}\right) \boldsymbol{V}^{\text{audio}}.
$$


Expand All @@ -103,12 +107,13 @@ align: center

In addition to the type of mechanism used, depending on the level at which modalities are combined, it is also common to distinguish between *early* (i.e. at the input level), *intermediate* (at the level of latent representations produced by an intermediate step in the overall processing pipeline) or *late* fusion (i.e. at the output level). We note that the terms *early, intermediate* and *late* fusion do not have an unequivocal definition and are used slightly differently in different works.

(multimodal_ar)=
## Multimodal AR Models
The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. We call these *adapted LLMs*. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.

## Multimodal Autoregressive Models
The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, today's state-of-the-art models rely on LLMs in one form or another. Typically, this means that music description systems closely mimic text-only autoregressive modelling via Transformers, but within this framework there are two main routes we can take. The first, and most common, is to adapt text-only LLMs so that they become multimodal by augmenting them with additional modelling components. A second option is to instead treat audio and text as sequences of tokens from the start, devising tokenization techniques and training on multiple modalities without additional modality-specific components. The line between these two approaches is not always clear. In the next section, we attempt to better define the salient characteristics of LLMs adapted to music-language inputs, and sketch out the newer trend towards natively multimodal models and its potential in music description.

Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding tasks by simply allowing users to query via text and obtain information about a given audio input. This is the machanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section.
Overall, a common thread in this line of work is the attempt to unify multimodal tasks by reframing all as text generation. When trained on music data, multimodal LLMs can therefore leverage their text-based interface to enable a variety of music understanding and description tasks by simply allowing users to query via text and obtain information about a given audio input. This is the mechanism that enables the conversation-based music description tasks we have seen in the [Tasks](description_tasks) section.

(adapted_llms)=
### Adapted LLMs
One modelling paradigm that has become particularly popular in audio description, including music, is that of adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities
such as audio. This is achieved via an *adapter* module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings.
Expand All @@ -121,31 +126,23 @@ align: center
---
```

🚧

Alongside music-specialised multimodal LLMs, a LLM with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:
* SALMONN {cite}`tang_salmonn_2023`
* Pengi {cite}`deshmukh_pengi_2023`
* Qwen-Audio `chu_qwen-audio_2023`
* LTU
* [Audio-LLM: Activating the Capabilities of Large Language Models to Comprehend Audio Data](https://link.springer.com/chapter/10.1007/978-981-97-4399-5_13)

We don't discuss these in detail, but their high-level design is similar to the music-specialised models we've seen in this section.

#### Adapter Modules
The architecture of the adapter modules employed in adapted LLMs for music typically consists of lightweight MLPs (between 2 and 3 hidden layers) or Q-Formers. Other architectures utilised in general audio adapted LLMs (or similar models in the visual domain) also include more complex designs such as Gated XATTN dense layers. [This blog post](https://lilianweng.github.io/posts/2022-06-09-vlm/) about Visual Language Models reviews these in more detail.

#### Training
From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.

##### Instruction Tuning
Alongside music-specialised multimodal LLMs such as those in {numref}`description_models_table`, LLMs with general-audio understanding capabilities can similarly perform music description tasks such as captioning and MQA. Among these we count:
* SALMONN {cite}`tang_salmonn_2023`
* Pengi {cite}`deshmukh_pengi_2023`
* Qwen-Audio {cite}`chu_qwen-audio_2023`
* LTU {cite}`gong2023listen`
* Audio Flamingo {cite}`kong2024audio_flamingo`
* Audio-LLM {cite}`zhang2024audio_llm`

### Natively Multimodal AR Models
Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
Adapted LLMs allow to transform text-only LLMs into multimodal models relatively efficiently: based on the models discussed in this section, around 20-150k audio-text paired samples are required to perform the adaptation stage of training, while multimodal pre-training would require orders of magnitude more data. However, this also limits their performance and often results in a bias towards the language modality and poor audio and music understanding capabilities {cite}`weck_muchomusic_2024`. An alternative that promises to overcome this limitation is to instead adopt a natively multimodal approach to AR modelling. One key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
This paradigm is sometimes referred to as mixed-modal early-fusion modelling.

It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have:
* AnyGPT {cite}`doh2023lp`
*
It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models, such as AnyGPT {cite}`zhan-etal-2024-anygpt`, include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come.

## References

Expand Down
2 changes: 1 addition & 1 deletion _sources/description/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ align: center

A key difference between dialogue-based description and one-off captioning is that, instead of an `audio --> text` mapping, we are now dealing with an `(audio, text) --> text` mapping. This is reflected in the different model designs typically considered for these tasks (see [Models](description_models)). Differently from simple MQA, in music dialogue generation, responses are expected to be based on the entire dialogue history instead of only considering the current input.

In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description-evaluation))!
In terms of real-world applications, the advantages of dialogue-based description are clear: instead of being constrained to a one-shot caption or answer, it allows users to provide text inputs to further instruct the model on what kind of information should be included, or how the text output itself should be structured. In short, these tasks make for a much more flexible approach which better reflects real-world use. One drawback is that they are harder to evaluate (see [Evaluation](description_evaluation))!

## References

Expand Down
Loading

0 comments on commit 3f374c5

Please sign in to comment.