Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ilaria-manco committed Nov 10, 2024
1 parent c381f2c commit 124ca69
Show file tree
Hide file tree
Showing 18 changed files with 7,897 additions and 163 deletions.
20 changes: 4 additions & 16 deletions _sources/description/datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"metadata": {},
"source": [
"(caption_datasets)=\n",
"# Music Description Datasets\n",
"# Datasets\n",
"\n",
"## Overview\n",
"Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn't directly distributed with the dataset and may be subject to copyright. \n",
Expand Down Expand Up @@ -36,16 +36,8 @@
"### MusicCaps (MC)\n",
"{cite}`agostinelli2023musiclm`\n",
"\n",
"🚧\n",
"\n",
"```{warning}\n",
"🚧\n",
"```\n",
"\n",
"### The Song Describer Dataset (SDD)\n",
"{cite}`manco2023song`\n",
"\n",
"🚧"
"{cite}`manco2023song`"
]
},
{
Expand Down Expand Up @@ -163,17 +155,13 @@
"### YouTube8M-MusicTextClips (YT8M-MTC)\n",
"{cite}`mckee2023language`\n",
"\n",
"🚧\n",
"\n",
"### Datasets with synthetic text \n",
"The three human-written caption datasets we have just seen often form the basis for other derived datasets that use text templates to extend the original captions (MusicBench), or LLM-enabled augmentation to transform them into question-answer pairs (e.g. MusicQA, MusicInstruct)\n",
"\n",
"🚧\n",
"\n",
"* LP-MusicCaps: captions generated from ground-truth tags via OpenAI's GPT-3.5 Turbo, 2.2M (MSD), 88k (MTT), 22k (MC) captions, \n",
"* MusicBench: captions from the MusicCaps dataset expanded with features extracted from audio (chords, beats, tempo, and key)\n",
"* MusicQA: 🚧\n",
"* MusicInstruct: 🚧\n",
"* MusicQA: \n",
"* MusicInstruct: \n",
"\n",
"## References\n",
"\n",
Expand Down
3 changes: 1 addition & 2 deletions _sources/description/evaluation.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(description_evaluation)=
# Music Description Evaluation
# Evaluation

## Overview
Reliably evaluating music description systems is a challenging endeavour. Even when we have "grounth-truth" captions, it is not always clear how to score generated text, as music description is open-ended, and at least partially subjective. The quality of a description is also strongly dependent on the context in which it is used. This issue gets even more pronounced with more dialogue-based tasks like MQA or other forms of instruction-based description.
Expand Down Expand Up @@ -27,7 +27,6 @@ We briefly review each of these metrics below:
* **BERT-Score** also computes the similarity between tokens in a generated sentence and tokens in the ground-truth text, but does so using contextual embeddings obtained from a pre-trained BERT model instead of exact matches, resulting in a higher correlation with human judgements.

## Other types of automatic evaluation
🚧
* Multiple-choice question answering: MuChoMusic {cite}`weck_muchomusic_2024`
* Other benchmarks: OpenMU {cite}`zhao_openmu_2024`
* LLM-as-a-judge
Expand Down
9 changes: 1 addition & 8 deletions _sources/description/models.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(description_models)=
# Music Description Models
# Models

## Overview

Expand Down Expand Up @@ -136,22 +136,15 @@ We don't discuss these in detail, but their high-level design is similar to the

#### Adapter Modules

Let's look at some examples of adapter modules in the literature.

🚧

#### Training
From the perspective of training, similarly to the text-only setting, training adapted LLMs is usually broken into several stages. After pre-training and finetuning of the text-only part, the remaining components undergo a series of multimodal training stages, while the backbone LLM is either kept frozen or further finetuned. These steps are usually a mixture of multi-task pre-training and supervised finetuning, often including instruction tuning, all carried out on pairs of audio and text data.

##### Instruction Tuning
Common to both designs, instruction tuning deserves particular attention in our discussion of music-language AR Transformers. 🚧

### Natively Multimodal Autoregressive Transformers
Other autoregressive Transformer models for music description share a similar core modelling mechanism to adapted LLM. But one key difference is that, while adapted LLMs require modality-specific encoders, usually pre-trained separately, natively multimodal LLMs forgo this in favour of a unified tokenization scheme that treats audio tokens much like text tokens from the start.
This paradigm is sometimes referred to as mixed-modal early-fusion modelling.

🚧

It's worth noting that, at this time, this type of model is a promising direction for music description, rather than a fully established paradigm. Currently, no music-specialised multimodal AR Transformers exist, but some general-purpose models include music-domain data in their training and evaluation. This is in line with the overall trend of developing large-scale models that tackle all domains, but it remains to be seen what the impact of this modalling paradigm will be on music description in the years to come. Among current examples of this type of model that include music description we have:
* AnyGPT {cite}`doh2023lp`
*
Expand Down
4 changes: 1 addition & 3 deletions _sources/description/tasks.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(description_tasks)=
# Music Description Tasks
# Tasks
As we've established, music description encompasses multiple different tasks.
Let's now look at each of these in more detail, going from those that give the simplest form of output (a categorical label) to those that produce more complex, natural language-based outputs. Through this, we also review some of the history of deep learning-based AMU systems.

Expand Down Expand Up @@ -91,8 +91,6 @@ align: center
---
```

🚧

## References

```{bibliography}
Expand Down
Loading

0 comments on commit 124ca69

Please sign in to comment.