-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Unknown
committed
Nov 10, 2024
1 parent
41b11c5
commit 0a8f95a
Showing
79 changed files
with
16,105 additions
and
2,216 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: db8c3136e21a7390d113749b043e83a3 | ||
config: 79bb7ee00347aa10773981b437044cde | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,20 @@ | ||
# Conclusion | ||
# Conclusion | ||
|
||
Congratulations! You finished the book, executed every code we typed, and read every line we wrote! | ||
|
||
In the first chapter, The Basics, we defined music classification and introduced its applications. We then looked into input representations with a special focus on biological plausibility. We also looked into music classification datasets with a special focus on the secrets of how to use some popular datasets correctly. In the evaluation section, we showed the concepts of important metrics such as precision and recall as well as code demo to compute them. After finishing this chapter, we hope you’re ready to start working on your music classification model. | ||
|
||
In the second chapter, Supervised Learning, we reviewed popular architectures - their definitions, pros, and cons. We also demonstrated data augmentation methods for music audio - the code, spectrograms, and audio signals you can play. At the end of the chapter, we showed a full example of data preparation, model training, and evaluation on Pytorch. After this chapter, you can implement a majority of music classification models that were introduced during the deep learning era. | ||
|
||
In the third chapter, Semi-Supervised Learning, we covered transfer learning and semi-supervised learning – approaches that became popular, recently, due to annotation cost. Both are strategies one can consider when there is only a small number of labeled items. These approaches can be useful in many real-world situations where you only have, for example, less than a thousand labeled items. | ||
|
||
In the fourth chapter, Self-Supervised Learning, an even more radical approach. The goal of self-supervised learning is to learn useful representations without any labels. To achieve the goal, researchers assume some structural/internal patterns purely within input and design loss functions to predict the patterns. We covered a wide range of self-supervised learning methods introduced in music, speech, and computer vision. The lesson of this chapter liberates you from the worry of getting annotations. | ||
|
||
In the fifth chapter, Towards Real-world Applications, we introduce you to what people care about in industry. After finishing this chapter, you can understand the procedures and tasks researchers and engineers in industry spend time on. | ||
|
||
We’re delighted that you have studied music classification with us. Did you achieve your goal while reading it? Are your questions solved now? We hope we also achieved our goals - lowering the barrier of music classification to the newcomers, providing methods to cope with data issues, and narrowing the gap between academia and industry. Please feel free to reach out to us if you have any questions or feedback. | ||
|
||
Best wishes, | ||
|
||
Minz, Janne, and Keunwoo. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,7 +6,7 @@ | |
"id": "AU2aKMwT21Oq" | ||
}, | ||
"source": [ | ||
"# Code Tutoiral" | ||
"# Code Tutorial" | ||
] | ||
}, | ||
{ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Evaluation | ||
|
||
## Overview | ||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,63 @@ | ||
# Audio-Text Joint Embedding | ||
|
||
## Classification to Joint Embedding | ||
|
||
Following the classification framework, the audio-text joint embedding methodology emerged as a way to handle more flexible user queries. Audio-text joint embedding, as a multimodal deep metric learning model, enables music search beyond fixed vocabularies by leveraging language embeddings from pretrained language models. In this approach, we project audio content and its associated text into a space where dot product operations are possible. | ||
|
||
## Model Architecture | ||
|
||
```{figure} ./img/cls_to_je.png | ||
--- | ||
name: classification to joint embedding | ||
--- | ||
``` | ||
|
||
At a high level, a joint embedding model is trained with paired text and music audio samples, learning to map related pairs close together in the embedding space while pushing unrelated samples further apart. | ||
|
||
Let $x_{a}$ represent a musical audio sample and $x_{t}$ denote its paired text description. The functions $f(\cdot)$ and $g(\cdot)$ represent the audio and text encoders respectively. The output feature embeddings from each encoder are mapped to a shared co-embedding space through projection layers. During training, the model typically employs either triplet loss based on hinge margins or contrastive loss based on cross entropy to learn these mappings. | ||
|
||
## Loss Functions | ||
|
||
The most common metric learning loss functions used to train joint embedding models are triplet loss and contrastive loss. | ||
|
||
```{figure} ./img/loss_functions.png | ||
--- | ||
name: loss functions | ||
--- | ||
``` | ||
|
||
The goal of triplet-loss models is to learn an embedding space where relevant input pairs are mapped closer than irrelevant pairs in the latent space. The objective function is formulated as follows: | ||
|
||
$$ | ||
\mathcal{L}_{triplet}= \text{max}(0, - f(x_{a}) \cdot g(x_{t}^{+}) + f(x_{a}) \cdot g(x_{t}^{-}) + \delta ) | ||
$$ | ||
where $\delta$ is the margin, $f(x_{a})$ is the audio embedding, $g(x_{t}^{+})$ is the paired text embedding for the music audio, and $g(x_{t}^{-})$ is the irrelevant text embedding. | ||
|
||
The core idea of contrastive-loss models is to reduce the distance between positive sample pairs while increasing the distance between negative sample pairs. Unlike triplet-loss models, contrastive-loss models can utilize a large number of negative samples that exist in a mini batch $N$. During training, the audio and text encoders are jointly trained to maximize the similarity between $N$ positive pairs of (music, text) associations while minimizing the similarity for $N \times (N-1)$ negative pairs. This is known as the multi-modal version of InfoNCE loss {cite}`oord2018representation`, {cite}`radford2021learning` and formulated as follows: | ||
|
||
$$ | ||
\mathcal{L}_\text{Contrastive} = - \frac{1}{N} \sum_{i=1}^N \log \frac{\exp(f(x_{a_i}) \cdot g(x_{t_i}^{+}) / \tau)}{\sum_{j=1}^N \exp(f(x_{a_i}) \cdot g(x_{t_j}) / \tau)} | ||
$$ | ||
where $\tau$ is a learnable parameter. | ||
|
||
## What is the Benefit of Joint Embedding? | ||
|
||
```{figure} ./img/joint_embedding_benefit.png | ||
--- | ||
name: joint embedding benefit | ||
--- | ||
``` | ||
The key advantage of joint embedding is that we can leverage the embedding space of pretrained language models as supervision, rather than being limited to a fixed vocabulary. Since pretrained language models are trained on vast text corpora from the internet, they effectively encode language relationships between words and phrases. In music retrieval scenarios, this allows us to handle zero-shot user queries efficiently by utilizing these rich language representations. | ||
|
||
Additionally, by using language model encoders, we can address the out-of-vocabulary problem through subword tokenization techniques like byte-pair encoding (BPE) or sentence-piece encoding. These methods break down unknown words into smaller subword units that exist in the model's vocabulary, enabling the system to handle novel queries. | ||
|
||
This combination of pretrained language model semantics and subword tokenization provides two key benefits: | ||
1. Flexible handling of open vocabulary queries through language model representations | ||
2. Robust processing of out-of-vocabulary words through subword tokenization | ||
|
||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Models | ||
|
||
In this chapter, we review recent advances in audio-text joint embedding models and discuss useful design choices and tips for training them. | ||
|
||
|
||
## Audio-Tag Joint Embedding | ||
|
||
```{figure} ./img/choi_zeroshot.png | ||
--- | ||
name: Audio-Tag Joint Embedding | ||
--- | ||
``` | ||
|
||
The early audio-text joint embedding work introduced to the ISMIR community was by {cite}`choi2019zero`. They emphasized the effectiveness of pretrained word embeddings (GloVe) in zero-shot music annotation and retrieval scenarios. Subsequently, {cite}`won2021multimodal` extended this idea beyond just audio by including collaborative filtering embeddings, covering both acoustic and cultural aspects of music. {cite}`won2021multimodal`, {cite}`doh2024musical` addressed the non-music-domain-specific nature of word embeddings by training audio-text joint embeddings using music domain-specific word embeddings. | ||
|
||
However, these models faced limitations in handling multiple attribute queries or complex sentence-level queries due to their reliance on word embeddings. This is because word embeddings are static - they do not encode different meanings based on surrounding context tokens. As a result, research using these models was constrained to tag-level retrieval scenarios. | ||
|
||
## Audio-Multi Tag Joint Embedding | ||
|
||
To better handle multiple attribute semantic queries, researchers have shifted their focus from word embeddings to bi-directional transformer encoders {cite}`devlin2018bert` {cite}`liu2019roberta`. They aimed to leverage **Contextualized Word Representations** that can encode different meanings of multiple attributes or sentences based on co-occurring words. {cite}`chen2022learning` and {cite}`doh2023toward` evaluated the language model's ability to understand multiple attribute queries by utilizing existing multilabel tagging datasets. | ||
|
||
## Audio-Sentence Joint Embedding | ||
|
||
```{figure} ./img/clap_mulan.png | ||
--- | ||
name: Audio-Sentence Joint Embedding | ||
--- | ||
``` | ||
|
||
To handle flexible natural language queries, researchers focused on noisy audio-text datasets {cite}`huang2022mulan` and human-generated natural language annotations beyond traditional annotation datasets {cite}`manco2022contrastive`. Thanks to sufficient dataset scaling and contrastive loss with the advantage of large batch sizes, they built joint embedding models with stronger audio-text associations compared to previous studies. {cite}`manco2022contrastive`, {cite}`huang2022mulan`, {cite}`wu2023large` demonstrated that contrastive learning with large-scale audio-text pairs could effectively learn semantic relationships between music and natural language descriptions. | ||
|
||
|
||
## Beyond semntica attributes, toward handle similarity queries | ||
|
||
|
||
```{figure} ./img/doh_enrich.png | ||
--- | ||
name: Similarity Queries | ||
--- | ||
``` | ||
|
||
Recent work has explored expanding joint embedding models beyond semantic attribute queries. While existing datasets focus on genre, mood, instruments, style, and theme attributes, {cite}`doh2024musical` proposed training joint embedding models that can handle similarity-based queries by leveraging diverse metadata and music knowledge graphs. This enables the model to understand relationships between songs based on metadata similarity rather than just semantic attributes, supporting more flexible music retrieval use cases. | ||
|
||
## Design choices for audio-text joint embedding models | ||
|
||
|
||
## Tips for training audio-text joint embedding models | ||
|
||
|
||
## References | ||
|
||
```{bibliography} | ||
:filter: docname in docnames | ||
``` |
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.