-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
55145be
commit 853e86c
Showing
425 changed files
with
11,818 additions
and
53,933 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 4f3b7b65b49c13cc9151b09ed2b87d73 | ||
config: b34b5dcc849dfc5c0b598ed8e0ccac59 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Binary file removed
BIN
-65.5 KB
_images/858c0c3a67cb46fbe88e58ec11eb1d4cb103c52f5262b85390cd4c9c87137962.png
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
|
||
| Dataset | Description | Size | Public | Audio Length | Audio | Text | License | ||
| ------- | ------ | ------ | ---- | ---- | ---- | ---- | ---- | | ||
| [MusicCaps](https://www.kaggle.com/datasets/googleai/musiccaps) | Music captions written my musicians | 5,521 | ✅ | 10 sec | ❌ (YouTube IDs from AudioSet)| Human-written | CC BY-SA 4.0 | | ||
| [Song Describer Dataset]() | Crowdsourced music captions | 1186 | ✅ | 2 min | ✅ (MTG-Jamendo) | Human-written | CC BY-SA 4.0 | | ||
| [YouTube8M-MusicTextClips]() | Crowdsourced music captions | 3169 |✅| 10 sec | ❌ (YouTube IDs from YouTube8M) | Human-written | CC BY-SA 4.0 | | ||
| [LP-MusicCaps]() | | 1k audio-caption pairs |✅| From [...] | Synthetic | | | ||
| [MusicQA]() | | 1k audio-caption pairs |✅| YouTube IDs | Synthetic| | | ||
| [MusicInstruct]() | | 1k audio-caption pairs |✅| YouTube IDs | Synthetic| | | ||
| [MusicBench]() | | 1k audio-caption pairs |✅| YouTube IDs | Synthetic| | | ||
|[MuEdit]() | Album reviews | 65,566 albums and 263,525 reviews | ❌ | ❌ | | ||
|[MARD]() | Album reviews | 65,566 albums and 263,525 reviews | ❌ | ❌ | |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Music Description Datasets | ||
|
||
## Overview | ||
Below we review existing music datasets that contain natural language text accompanying music items. These can be music descriptions such as captions, or other types of text such as question-answer pairs and reviews. Note that in many cases the audio isn't directly distributed with the dataset and may be subject to copyright. | ||
|
||
Among the caption datasets, only three feature fully human-written descriptions: MusicCaps, the Song Describer Dataset and YouTube8M-MusicTextClips. | ||
|
||
| Dataset | Description | Size | Audio | Audio Length | Audio License| Text | Text License | ||
| ------- | ------ | ---- | ---- | ---- | ---- | ---- | ---- | | ||
| [MusicCaps](https://www.kaggle.com/datasets/googleai/musiccaps) | Captions (by musicians) | 5.5k | ❌ <br> (YT IDs from AudioSet)| 10 sec | | Human-written | | CC BY-SA 4.0 | | ||
| [Song Describer Dataset](https://zenodo.org/records/10072001) | Captions (crowdsourced) | 1.2k | ✅ <br> (MTG-Jamendo) | 2 min | | Human-written | | CC BY-SA 4.0 | | ||
| [YouTube8M-MusicTextClips](https://zenodo.org/records/8040754) | Captions (crowdsourced) | 3.2k | ❌ <br> (YT IDs from YouTube8M) | 10 sec | | Human-written | | CC BY-SA 4.0 | | ||
| [LP-MusicCaps](https://github.com/seungheondoh/lp-music-caps) | Captions (generated from tags) | 2.2M / 88k / 22k | ❌ <br> (MusicCaps, MagnaTagATune, and Million Song Dataset ECALS) | 30s / 10s| Synthetic | | CC-BY-NC 4.0 | | ||
| [MusicQA](https://huggingface.co/datasets/mu-llama/MusicQA) | Question-answer pairs (generated from tags/captions) | | ❌ <br> (MusicCaps, MagnaTagATune) | YT IDs | | Synthetic| | | ||
| [MusicInstruct](https://huggingface.co/datasets/m-a-p/Music-Instruct) | Question-answer pairs (generated from captions) | 28k / 33k |YT IDs | | Synthetic| | | ||
| [MusicBench](https://paperswithcode.com/dataset/musicbench) | Captions (expanded via text templates) | 53k | ✅ | 10s | ❌ <br> (from MusicCaps) | Human-written + text templates | CC | | ||
|[MuEdit]() | Album reviews | | ❌ | ❌ | | ||
|[MARD]() | Album reviews | 264k| ❌ | ❌ | | ||
|
||
## Music Caption Datasets | ||
Focusing on the caption datasets, let's now review these in more detail. | ||
|
||
### MusicCaps (MC) | ||
{cite}`agostinelli2023musiclm` | ||
|
||
[TODO] | ||
|
||
```{warning} | ||
[TODO] | ||
``` | ||
|
||
### The Song Describer Dataset (SDD) | ||
{cite}`manco2023song` | ||
|
||
[TODO] | ||
|
||
### YouTube8M-MusicTextClips (YT8M-MTC) | ||
{cite}`mckee2023language` | ||
|
||
[TODO] | ||
|
||
### Datasets with synthetic text | ||
The three human-written caption datasets we have just seen often consistute the basis for other derived datasets that use text templates to extend the original captions (MusicBench), or LLM-enabled augmentation to transform them into question-answer pairs (e.g. MusicQA, MusicInstruct) | ||
|
||
[TODO] | ||
|
||
* LP-MusicCaps: captions generated from ground-truth tags via OpenAI's GPT-3.5 Turbo, 2.2M (MSD), 88k (MTT), 22k (MC) captions, | ||
* MusicBench: captions from the MusicCaps dataset expanded with features extracted from audio (chords, beats, tempo, and key) | ||
* MusicQA: [TODO] | ||
* MusicInstruct: [TODO] | ||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Music Description Evaluation | ||
Evaluation of music description systems | ||
|
||
## Match-based metrics | ||
Borrowing from the computer vision and NLP literature, music captioning is typically evaluated through a set of automatic metrics such as BLEU, METEOR, ROUGE and, more recently, BERT-score. | ||
|
||
## Other types of automatic evaluation | ||
* Multiple-choice question answering | ||
* LLM-as-a-judge | ||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,29 @@ | ||
# Music Annotation | ||
# Overview | ||
|
||
## What is music description? | ||
The goal of *automatic music description* (AMU) is to analyse music audio and "translate" it into human-readable form. | ||
As such, AMU is an umbrella term that covers several tasks. | ||
|
||
We can distinguish between different types of music description along two axes: | ||
- **Abstraction**: the abstraction level of the features captured in the description ("*what* we describe") | ||
- **Complexity**: the complexity of the description ("*how* we describe") | ||
|
||
```{figure} ./img/description.png | ||
--- | ||
name: overview | ||
--- | ||
``` | ||
|
||
In this tutorial, we're particularly interested in natural language descriptions, which occupy [..] on the *complexity* axis, and usually span a wide range of abstraction levels within. | ||
|
||
## Why do we need automatic music description? | ||
Being able to automatically create descriptions of music audio content is useful for a variety of practical purposes. For example, through AMU we can: | ||
|
||
- Annotate large collections of music for easier search, navigation and organisation | ||
- Generate human-readable summaries of musical content for the Deaf or Hard-of-hearing, or when audio playback is not possible | ||
- Automatically caption music sections in videos and films | ||
- Produce educational resources | ||
- Enable personalised music recommendation systems using natural language queries | ||
|
||
In the last few years, AMU systems have also become widely used for **synthetic data generation**. In this case, instead of finding a direct application, they are exploited to generate additional text data from unlabeled or partially labeled audio. The synthetic descriptions are then used to support training of machine learning models on other tasks that require (audio, text) pairs, such as [text-to-music retrieval]() and [text-to-music generation](). This is how some of the latest audio-text music datasets are produced (see []()). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Music Description Models | ||
Deep learning models for music description via natural language typically fit into one of two designs: | ||
|
||
- Encoder-decoder | ||
- Multimodal (adapted) LLM | ||
|
||
## Encoder-Decoder Models | ||
The encoder-decoder modelling framework emerged in the context of sequence-to-sequence tasks, and it was first applied to machine translation. | ||
At a high-level, models of this type are composed of two main modules, an encoder and a decoder. The encoder is resposible for processing the | ||
input sequence (i.e. audio input) into an intermediate representation (the context $c$) and the decoder then "unrolls" this representation | ||
into a target sequence (e.g. text describing the audio input). The first example of encoder-decoder model for music description appeared in work | ||
by Choi *et al.* {cite}`choi2016towards`. These early iterations of encoder-decoder music captioners employed CNN-based audio encoders alongside | ||
RNN-based language decoders {cite}`choi2016towards` {cite}`manco2021muscaps`. More recent iterations of this framework typically make use of a | ||
Transformer-based language decoder, alongside CNNs [TODO] or Transformer audio encoders [TODO], and sometimes a hybrid of both [TODO]. | ||
|
||
$$ | ||
c = f_{\text{encoder}}(X) | ||
$$ | ||
|
||
$$ | ||
P(Y | c) = \prod_{t=1}^{n} P(y_t | y_1, y_2, \ldots, y_{t-1}, c) | ||
$$ | ||
|
||
Beyond architectural choices, a key aspect that differentiates different encoder-decoder models is the type of mechanism employed to fuse audio | ||
and text representations. The choice of this fusion mechanism is typically tied to the encoder architecture used. | ||
For example, in models that employ Transformer-based architectures, fusion often happens via cross-attention [TODO]. Instead, earlier models with RNN-based text decoders employed a wider range of fusion mechanisms, such as feature concatenation or cross-modal attention, at various levels of the processing [pipeline], from early to late stages. | ||
|
||
## Multimodal Autoregressive Transformers | ||
|
||
### Adapting LLMs to audio-text inputs | ||
The success of Large Language Models (LLMs) has largely influenced the development of music description in recent years. As a consequence, | ||
today's state-of-the-art models rely on LLMs in one form or another. One modelling paradigm that has become particularly popular is that of | ||
adapted (multimodal) LLMs. At the core of this approach is a pre-trained text-only LLM, which is adapted to take in inputs of different modalities | ||
such as audio. This is achieved via an *adapter* module, a light-weight neural network trained to map embeddings produced by an audio feature extractor (usually pre-trained and then frozen) to the input space of the LLM. As a result of this adaptation process, the LLM can then receive audio embeddings alongside text embeddings. | ||
|
||
Let's look at some examples of adapter modules in the literature. | ||
|
||
[TODO]. | ||
|
||
There are several techniques to pass audio-text embeddings | ||
|
||
## Other Modelling Paradigms | ||
[TODO] | ||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Music Description Tasks | ||
Let's now look at the range of music description tasks, going from those that give the "simplest" form of output (a categorical label) to those that produce more complex dialogue-based outputs. Through this, we also review some of the history of deep learning-based AMU systems. | ||
|
||
## From labels to captions | ||
|
||
Traditionally, music description in MIR has consisted of classification-based systems that learn to predict a single (or multiple) label(s) based on an audio input, | ||
with each label corresponding to a specific, pre-assigned descriptor. This is the case with tasks like music classification tasks on categories such as genre {cite}`tzanetakis2002musical`, instrument {cite}`herrera2003automatic`, mood {cite}`kim2010music`, key [TODO], tempo [TODO], and more. | ||
|
||
Classification produces categorical labels chosen from a predefined set which forms a fixed, and often small-size, vocabulary: $f(x) = \mathbf{p} \in [0, 1]^L$. Therefore this type of description is limited in its expressivity, as it cannot adapt to new concepts, or model the relationship between labels. | ||
|
||
|
||
```{figure} ./img/tags.png | ||
--- | ||
name: tagging | ||
--- | ||
``` | ||
|
||
For this reason, in recent years many have started incorporating natural language in music description systems, developing models that map audio inputs | ||
to full sentences. Enjoying the benefits of natural language, these systems can produce descriptions that are more nuanced, expressive and human-like. Given our focus on natural language, we mostly look at this type of music description in the rest of this tutorial. | ||
|
||
```{figure} ./img/caption.png | ||
--- | ||
name: captioning | ||
--- | ||
``` | ||
|
||
### Music Captioning | ||
In music audio captioning, the goal is to generate natural language outputs describing an audio input. We can think of it as a type of conditional language modelling, where we seek to predict the next token in a sequence, based not only on prior text tokens, but also on the audio: | ||
|
||
$$ | ||
P(y|a) = \prod_{t=1}^{n} P(y_t | y_1, y_2, \ldots, y_{t-1}, a) | ||
$$ | ||
|
||
Music captioning can be performed at the sub-track, track or multi-track level, depending on whether the audio input is a segment from a longer track (typically fixed-sized), a whole variable-length track, or a sequence of multiple tracks (i.e. a playlist). In the latter case, we usually refer to this type description task by *playlist captioning*. Instead, when using *music captioning* we usually mean captioning of either a clip or full track {cite}`manco2021muscaps`. | ||
|
||
## From captions to dialogues | ||
Some of them, in even more recent developments, they can answer questions or engage in multi-turn conversations about audio inputs. | ||
|
||
Largely influenced by dialogue-based LLM applications, | ||
|
||
(Audio, text) --> text | ||
|
||
### Music Question Answering | ||
|
||
{cite} @ | ||
|
||
### Music Dialogue Generation | ||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.