Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Unknown committed Nov 10, 2024
1 parent 41b11c5 commit 0a8f95a
Show file tree
Hide file tree
Showing 79 changed files with 16,105 additions and 2,216 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: db8c3136e21a7390d113749b043e83a3
config: 79bb7ee00347aa10773981b437044cde
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file added _images/choi_zeroshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/clap_mulan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/cls_methods.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/cls_to_je.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/doh_enrich.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed _images/generation.png
Binary file not shown.
Binary file removed _images/history.png
Binary file not shown.
Binary file modified _images/main.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed _images/oov.png
Binary file not shown.
Binary file removed _images/representation.png
Binary file not shown.
Binary file removed _images/retrieval_example.png
Binary file not shown.
21 changes: 20 additions & 1 deletion _sources/conclusion/intro.md
Original file line number Diff line number Diff line change
@@ -1 +1,20 @@
# Conclusion
# Conclusion

Congratulations! You finished the book, executed every code we typed, and read every line we wrote!

In the first chapter, The Basics, we defined music classification and introduced its applications. We then looked into input representations with a special focus on biological plausibility. We also looked into music classification datasets with a special focus on the secrets of how to use some popular datasets correctly. In the evaluation section, we showed the concepts of important metrics such as precision and recall as well as code demo to compute them. After finishing this chapter, we hope you’re ready to start working on your music classification model.

In the second chapter, Supervised Learning, we reviewed popular architectures - their definitions, pros, and cons. We also demonstrated data augmentation methods for music audio - the code, spectrograms, and audio signals you can play. At the end of the chapter, we showed a full example of data preparation, model training, and evaluation on Pytorch. After this chapter, you can implement a majority of music classification models that were introduced during the deep learning era.

In the third chapter, Semi-Supervised Learning, we covered transfer learning and semi-supervised learning – approaches that became popular, recently, due to annotation cost. Both are strategies one can consider when there is only a small number of labeled items. These approaches can be useful in many real-world situations where you only have, for example, less than a thousand labeled items.

In the fourth chapter, Self-Supervised Learning, an even more radical approach. The goal of self-supervised learning is to learn useful representations without any labels. To achieve the goal, researchers assume some structural/internal patterns purely within input and design loss functions to predict the patterns. We covered a wide range of self-supervised learning methods introduced in music, speech, and computer vision. The lesson of this chapter liberates you from the worry of getting annotations.

In the fifth chapter, Towards Real-world Applications, we introduce you to what people care about in industry. After finishing this chapter, you can understand the procedures and tasks researchers and engineers in industry spend time on.

We’re delighted that you have studied music classification with us. Did you achieve your goal while reading it? Are your questions solved now? We hope we also achieved our goals - lowering the barrier of music classification to the newcomers, providing methods to cope with data issues, and narrowing the gap between academia and industry. Please feel free to reach out to us if you have any questions or feedback.

Best wishes,

Minz, Janne, and Keunwoo.

2 changes: 1 addition & 1 deletion _sources/generation/code.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"id": "AU2aKMwT21Oq"
},
"source": [
"# Code Tutoiral"
"# Code Tutorial"
]
},
{
Expand Down
9 changes: 8 additions & 1 deletion _sources/introduction/advantange.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,14 @@
"source": [
"## 3. Natural Langauge is Human Friendly interface.\n",
"\n",
"Language serves as an effective interface for AI models, (i.e., ChatGPT and Stable Diffusion). Because it leverages natural, intuitive communication methods. Language allows users to express complex queries, requests, or ideas in a flexible and contextually rich way without needing specialized knowledge. In terms of responses, language can also enable the system to generate human-like intentions or answers, which can positively impact user satisfaction and usability."
"Language serves as an effective interface for AI models, (i.e., ChatGPT and Stable Diffusion). Because it leverages natural, intuitive communication methods. Language allows users to express complex queries, requests, or ideas in a flexible and contextually rich way without needing specialized knowledge. In terms of responses, language can also enable the system to generate human-like intentions or answers, which can positively impact user satisfaction and usability.\n",
"\n",
"\n",
"```{figure} ../img/prompt_product.png\n",
"---\n",
"name: prompt_product\n",
"---\n",
"```"
]
}
],
Expand Down
21 changes: 10 additions & 11 deletions _sources/introduction/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This tutorial will present the changes in music understanding, retrieval, and generation technologies following the development of language models.

```{figure} ../img/overview.png
```{figure} ./img/overview.png
---
name: scope
---
Expand All @@ -11,11 +11,12 @@ Illustration of the development of music and language models.

## Langauge Models

Chapter 2 presents an introduction to language models (LMs), essential for enabling machines to understand natural language and their wide-ranging applications. It traces the development from simple one-hot encoding and word embeddings to more advanced language models, including masked langauge model {cite}`devlin2018bert`, auto-regressive langauge model {cite}`radford2019language`, and encoder-decoder langauge model {cite}`raffel2020exploring`, progressing to cutting-edge instruction-following {cite}`wei2021finetuned` {cite}`ouyang2022training` {cite}`chung2024scaling` and large language models {cite}`achiam2023gpt`. Additionally, the chapter demonstrates how language models are utilized in various domains, such as vision and speech, highlighted by examples such as joint embedding techniques like CLIP {cite}`radford2021learning`, encoder-decoder frameworks like Whisper {cite}`radford2023robust`, and text-driven image generation models like DALL-E {cite}`ramesh2021zero`.
Chapter 2 presents an introduction to language models (LMs), essential for enabling machines to understand natural language and their wide-ranging applications. It traces the development from simple one-hot encoding and word embeddings to more advanced language models, including masked langauge model {cite}`devlin2018bert`, auto-regressive langauge model {cite}`radford2019language`, and encoder-decoder langauge model {cite}`raffel2020exploring`, progressing to cutting-edge instruction-following {cite}`wei2021finetuned` {cite}`ouyang2022training` {cite}`chung2024scaling` and large language models {cite}`achiam2023gpt`. Furthermore, we review the components and conditioning methods of language models, as well as explore current challenges and potential solutions when using language models as a framework.

## Music Annotation (Music -> Natural Language)

```{figure} ../img/annotation.png
## Music Description

```{figure} ./img/annotation.png
---
name: annotation
---
Expand All @@ -24,22 +25,20 @@ name: annotation
Chapter 3 offers an in-depth look at music annotation as a tool for enhancing music understanding. It begins with defining the task and problem formulation, transitioning from basic classification {cite}`turnbull2008semantic` {cite}`nam2018deep` to more complex language decoding tasks. The chapter further explores encoder-decoder models {cite}`manco2021muscaps` {cite}`doh2023lp` and the role of multimodal large language models (LLMs) in music understanding {cite}`gardner2023llark`. The chapter explores the evolution from `task-specific classification models` to `more generalized multitask models` trained with diverse natural language supervision.


## Music Retrieval (Natural Language -> Database Music)

## Music Retrieval

```{figure} ../img/retrieval.png
```{figure} ./img/retrieval.png
---
name: retrieval
---
```

Chapter 4 focuses on text-to-music retrieval, a key component in music search, detailing the task's definition and various search methodologies. It spans from basic boolean and vector searches to advanced techniques that bridge words to music through joint embedding methods {cite}`choi2019zero`, addressing challenges like out-of-vocabulary terms. The chapter progresses to sentence-to-music retrieval {cite}`huang2022mulan` {cite}`manco2022contrastive` {cite}`doh2023toward`, exploring how to integrate complex musical semantics, and conversational music retrieval for multi-turn dialog-based music retrieval {cite}`chaganty2023beyond`. It introduces evaluation metrics and includes practical coding exercises for developing a basic joint embedding model for music search. This chapter focuses on how models address `users' musical queries` in various ways.

Chapter 4 focuses on text-to-music retrieval, a key component in music search, detailing the task's definition and various search methodologies. It spans from basic boolean and vector searches to advanced techniques that bridge words to music through joint embedding methods {cite}`choi2019zero`, addressing challenges like out-of-vocabulary terms. The chapter progresses to sentence-to-music retrieval {cite}`huang2022mulan` {cite}`manco2022contrastive` {cite}`doh2023toward`, exploring how to integrate complex musical semantics, and conversational music retrieval for multi-turn dialog-based music retrieval {cite}`chaganty2023beyond`. It introduces evaluation metrics and includes practical coding exercises for developing a basic joint embedding model for music search. This chapter focuses on how models address `users' musical queries` in various ways.

## Music Generation (Natural Language -> Sampled Music)

## Music Generation

```{figure} ../img/generation.png
```{figure} ./img/generation.png
---
name: generation
---
Expand Down
8,856 changes: 8,837 additions & 19 deletions _sources/retrieval/code.ipynb

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions _sources/retrieval/evaluate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Evaluation

## Overview




30 changes: 25 additions & 5 deletions _sources/retrieval/intro.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# Introduction to Text-to-Music Retrieval
# Introduction


```{figure} ../img/retrieval_example.png
```{figure} ./img/cal_retrieval.png
---
name: cal_retrieval
---
Expand All @@ -15,10 +14,31 @@ Music retrieval is the task of finding a set of music pieces that match a given
we describe such a system and show that it can both annotate novel audio content with semantically meaningful words and retrieve relevant audio tracks from a database of unannotated tracks given a text-based query. We view the related tasks of semantic annotation and retrieval of audio as one supervised multiclass, multilabel learning problem. {cite}`turnbull2008semantic`
```

However, the vocabulary of the datasets we use unfortunately only contains between 50 and 200 words, and when a user wants to search for music using words outside of this vocabulary, an Out of Vocabulary problem occurs. As a result, we face the need to search for desired music even in `open vocabulary problem`. If we try to solve this using supervised classification, we would need an enormous number of classification models for each task, and it would be too costly to convert all user queries into data. In this chapter, we introduce joint embedding-based retrieval models used to reflect various user text queries, and also introduce conversational music search models that can not only find music but also generate responses.
```{figure} ./img/cls_methods.png
---
name: cls_methods
---
```

Early retrieval methods were based on classification models. If music is annotated with relevant attributes through initial tagging tasks, during the retrieval stage, music can be searched either through filtering-based boolean search or by using the output logits from classification.


```{figure} ./img/cls_problems.png
---
name: cls_problems
---
```

However, the classification-based retrieval framework has two problems. First, since we train task-specific classification models, we need to have as many classification models as the number of tasks. Second, the vocabulary of the datasets we use unfortunately only contains between 50 and 200 words, and when a user wants to search for music using words outside of this vocabulary, an Out of Vocabulary problem occurs. As a result, we face the need to search for desired music even in `open vocabulary problem`. If we try to solve this using supervised classification, we would need an enormous number of classification models for each task, and it would be too costly to convert all user queries into data.

```{figure} ../img/oov.png
```{figure}
---
name: oov
---
```

## References

```{bibliography}
:filter: docname in docnames
```
62 changes: 62 additions & 0 deletions _sources/retrieval/joint_embedding.md
Original file line number Diff line number Diff line change
@@ -1 +1,63 @@
# Audio-Text Joint Embedding

## Classification to Joint Embedding

Following the classification framework, the audio-text joint embedding methodology emerged as a way to handle more flexible user queries. Audio-text joint embedding, as a multimodal deep metric learning model, enables music search beyond fixed vocabularies by leveraging language embeddings from pretrained language models. In this approach, we project audio content and its associated text into a space where dot product operations are possible.

## Model Architecture

```{figure} ./img/cls_to_je.png
---
name: classification to joint embedding
---
```

At a high level, a joint embedding model is trained with paired text and music audio samples, learning to map related pairs close together in the embedding space while pushing unrelated samples further apart.

Let $x_{a}$ represent a musical audio sample and $x_{t}$ denote its paired text description. The functions $f(\cdot)$ and $g(\cdot)$ represent the audio and text encoders respectively. The output feature embeddings from each encoder are mapped to a shared co-embedding space through projection layers. During training, the model typically employs either triplet loss based on hinge margins or contrastive loss based on cross entropy to learn these mappings.

## Loss Functions

The most common metric learning loss functions used to train joint embedding models are triplet loss and contrastive loss.

```{figure} ./img/loss_functions.png
---
name: loss functions
---
```

The goal of triplet-loss models is to learn an embedding space where relevant input pairs are mapped closer than irrelevant pairs in the latent space. The objective function is formulated as follows:

$$
\mathcal{L}_{triplet}= \text{max}(0, - f(x_{a}) \cdot g(x_{t}^{+}) + f(x_{a}) \cdot g(x_{t}^{-}) + \delta )
$$
where $\delta$ is the margin, $f(x_{a})$ is the audio embedding, $g(x_{t}^{+})$ is the paired text embedding for the music audio, and $g(x_{t}^{-})$ is the irrelevant text embedding.

The core idea of contrastive-loss models is to reduce the distance between positive sample pairs while increasing the distance between negative sample pairs. Unlike triplet-loss models, contrastive-loss models can utilize a large number of negative samples that exist in a mini batch $N$. During training, the audio and text encoders are jointly trained to maximize the similarity between $N$ positive pairs of (music, text) associations while minimizing the similarity for $N \times (N-1)$ negative pairs. This is known as the multi-modal version of InfoNCE loss {cite}`oord2018representation`, {cite}`radford2021learning` and formulated as follows:

$$
\mathcal{L}_\text{Contrastive} = - \frac{1}{N} \sum_{i=1}^N \log \frac{\exp(f(x_{a_i}) \cdot g(x_{t_i}^{+}) / \tau)}{\sum_{j=1}^N \exp(f(x_{a_i}) \cdot g(x_{t_j}) / \tau)}
$$
where $\tau$ is a learnable parameter.

## What is the Benefit of Joint Embedding?

```{figure} ./img/joint_embedding_benefit.png
---
name: joint embedding benefit
---
```
The key advantage of joint embedding is that we can leverage the embedding space of pretrained language models as supervision, rather than being limited to a fixed vocabulary. Since pretrained language models are trained on vast text corpora from the internet, they effectively encode language relationships between words and phrases. In music retrieval scenarios, this allows us to handle zero-shot user queries efficiently by utilizing these rich language representations.

Additionally, by using language model encoders, we can address the out-of-vocabulary problem through subword tokenization techniques like byte-pair encoding (BPE) or sentence-piece encoding. These methods break down unknown words into smaller subword units that exist in the model's vocabulary, enabling the system to handle novel queries.

This combination of pretrained language model semantics and subword tokenization provides two key benefits:
1. Flexible handling of open vocabulary queries through language model representations
2. Robust processing of out-of-vocabulary words through subword tokenization


## References

```{bibliography}
:filter: docname in docnames
```
54 changes: 54 additions & 0 deletions _sources/retrieval/models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Models

In this chapter, we review recent advances in audio-text joint embedding models and discuss useful design choices and tips for training them.


## Audio-Tag Joint Embedding

```{figure} ./img/choi_zeroshot.png
---
name: Audio-Tag Joint Embedding
---
```

The early audio-text joint embedding work introduced to the ISMIR community was by {cite}`choi2019zero`. They emphasized the effectiveness of pretrained word embeddings (GloVe) in zero-shot music annotation and retrieval scenarios. Subsequently, {cite}`won2021multimodal` extended this idea beyond just audio by including collaborative filtering embeddings, covering both acoustic and cultural aspects of music. {cite}`won2021multimodal`, {cite}`doh2024musical` addressed the non-music-domain-specific nature of word embeddings by training audio-text joint embeddings using music domain-specific word embeddings.

However, these models faced limitations in handling multiple attribute queries or complex sentence-level queries due to their reliance on word embeddings. This is because word embeddings are static - they do not encode different meanings based on surrounding context tokens. As a result, research using these models was constrained to tag-level retrieval scenarios.

## Audio-Multi Tag Joint Embedding

To better handle multiple attribute semantic queries, researchers have shifted their focus from word embeddings to bi-directional transformer encoders {cite}`devlin2018bert` {cite}`liu2019roberta`. They aimed to leverage **Contextualized Word Representations** that can encode different meanings of multiple attributes or sentences based on co-occurring words. {cite}`chen2022learning` and {cite}`doh2023toward` evaluated the language model's ability to understand multiple attribute queries by utilizing existing multilabel tagging datasets.

## Audio-Sentence Joint Embedding

```{figure} ./img/clap_mulan.png
---
name: Audio-Sentence Joint Embedding
---
```

To handle flexible natural language queries, researchers focused on noisy audio-text datasets {cite}`huang2022mulan` and human-generated natural language annotations beyond traditional annotation datasets {cite}`manco2022contrastive`. Thanks to sufficient dataset scaling and contrastive loss with the advantage of large batch sizes, they built joint embedding models with stronger audio-text associations compared to previous studies. {cite}`manco2022contrastive`, {cite}`huang2022mulan`, {cite}`wu2023large` demonstrated that contrastive learning with large-scale audio-text pairs could effectively learn semantic relationships between music and natural language descriptions.


## Beyond semntica attributes, toward handle similarity queries


```{figure} ./img/doh_enrich.png
---
name: Similarity Queries
---
```

Recent work has explored expanding joint embedding models beyond semantic attribute queries. While existing datasets focus on genre, mood, instruments, style, and theme attributes, {cite}`doh2024musical` proposed training joint embedding models that can handle similarity-based queries by leveraging diverse metadata and music knowledge graphs. This enables the model to understand relationships between songs based on metadata similarity rather than just semantic attributes, supporting more flexible music retrieval use cases.

## Design choices for audio-text joint embedding models


## Tips for training audio-text joint embedding models


## References

```{bibliography}
:filter: docname in docnames
```
2 changes: 1 addition & 1 deletion _static/scripts/bootstrap.js.map

Large diffs are not rendered by default.

3 changes: 0 additions & 3 deletions _static/scripts/fontawesome.js

This file was deleted.

1 change: 0 additions & 1 deletion _static/scripts/fontawesome.js.map

This file was deleted.

Loading

0 comments on commit 0a8f95a

Please sign in to comment.