-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Unknown
committed
Oct 18, 2024
1 parent
a618cb5
commit 2685abe
Showing
11 changed files
with
1,196 additions
and
29 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Background | ||
|
||
```{figure} ../img/history.png | ||
--- | ||
name: overview | ||
--- | ||
``` | ||
|
||
The journey of Music and Language Models started with two basic human desires: to **understand music** deeply and to **listen to the music** we want whenever we want, whether it's existing artist music or creative new music. These fundamental needs have driven the development of technologies that connect music and language. This is because language is the most fundamental communication channel we use, and through this language, we aim to communicate with machines. | ||
|
||
## Early Stage of Music Annotation and Retrieval | ||
|
||
The first approach was Supervised Classification. This method involved developing models that could predict appropriate Natural Language Labels (Fixed-Vocabulary) for given audio inputs. These labels could cover a wide range of musical attributes including genre, mood, style, instruments, usage, theme, key, tempo, and more {cite}`sordo2007annotating`. The advantage of Supervised Classification was that it automated the annotation process. As music databases grew richer with these annotations, in the retrieval phase, we could use cascading filters to find desired music more easily {cite}`eck2007automatic` {cite}`lamere2008social`. The research on supervised classification evolved over time. In the early 2000s, with advancements in pattern recognition methodologies, the focus was primarily on feature engineering {cite}`fu2010survey`. As we entered the 2010s, with the rise of deep learning, the emphasis shifted towards model engineering {cite}`nam2018deep`. | ||
|
||
```{note} | ||
If you're particularly interested in this area, please refer to the following tutorial: | ||
[ISMIR2008-Tutoiral: SOCIAL TAGGING AND MUSIC INFORMATION RETRIEVAL](https://www.slideshare.net/slideshow/social-tags-and-music-information-retrieval-part-i-presentation) | ||
[ISMIR2019-Tutoiral: Waveform-based music processing with deep learning](https://zenodo.org/records/3529714) | ||
[ISMIR2021-Tutoiral: Music Classification: Beyond Supervised Learning, Towards Real-world Applications](https://music-classification.github.io/tutorial/landing-page.html) | ||
``` | ||
|
||
However, supervised classification has two fundamental limitations. First, it only supports music understanding and search using fixed labels. This creates a problem where the model cannot handle unseen vocabulary. Second, language labels are represented through one-hot encoding, which means the model cannot capture relationships between different language labels. As a result, the trained model is specifically learned for the given supervision, limiting its ability to generalize and understand a wide range of musical language. | ||
|
||
## Early Stage of Music Generation | ||
|
||
Compared to Discriminative Models $p(y|x)$, which are relatively easier to model, Generative Models $p(x|c)$ that need to model data distributions initially focused on generating short single-instrument pieces or speech datasets rather than complex multi-track music. In the early stages, unconditioned generation $p(x)$ methods such as likelihood-based models (represented by WaveNet {cite}`van2016wavenet` and SampleRNN {cite}`mehri2016samplernn`) or adversarial models (represented by WaveGAN {cite}`donahue2018adversarial`) were studied. | ||
|
||
Early Conditioned Generation models $p(x|c)$ included the Universal Music Translation Network {cite}`mor2018universal`, which used a single shared encoder and different decoders for each instrument condition, and NSynth {cite}`engel2017neural`, which added pitch conditioning to WaveNet Autoencoders. These models represented some of the first attempts at controlled music generation. | ||
|
||
```{note} | ||
If you're particularly interested in this area, please refer to the following tutorial: | ||
[ISMIR2019-Tutoiral: Waveform-based music processing with deep learning, part 3](https://zenodo.org/records/3529714) | ||
[Generating Music in the waveform domain - Sander Dieleman](https://sander.ai/2020/03/24/audio-generation.html#fn:umtn) | ||
``` | ||
|
||
However, Generative Models capable of Natural Language Conditioning were not yet available at this stage. Despite the challenge of generating high-quality audio with long-term consistency, these early models laid the groundwork for future advancements in music generation technology. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Scope and Application | ||
|
||
```{figure} ./img/scpoe.png | ||
--- | ||
name: scope | ||
--- | ||
Illustration of the development of music and language models. | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.