Update documentation

mulab-mir · Oct 18, 2024 · 2685abe · 2685abe
1 parent a618cb5
commit 2685abe
Show file tree

Hide file tree

Showing 11 changed files with 1,196 additions and 29 deletions.
diff --git a/_images/history.png b/_images/history.png
diff --git a/_sources/introduction/overview.md b/_sources/introduction/overview.md
@@ -0,0 +1,38 @@
+# Background
+
+```{figure} ../img/history.png
+---
+name: overview
+---
+```
+
+The journey of Music and Language Models started with two basic human desires: to **understand music** deeply and to **listen to the music** we want whenever we want, whether it's existing artist music or creative new music. These fundamental needs have driven the development of technologies that connect music and language. This is because language is the most fundamental communication channel we use, and through this language, we aim to communicate with machines.
+
+## Early Stage of Music Annotation and Retrieval
+
+The first approach was Supervised Classification. This method involved developing models that could predict appropriate Natural Language Labels (Fixed-Vocabulary) for given audio inputs. These labels could cover a wide range of musical attributes including genre, mood, style, instruments, usage, theme, key, tempo, and more {cite}`sordo2007annotating`. The advantage of Supervised Classification was that it automated the annotation process. As music databases grew richer with these annotations, in the retrieval phase, we could use cascading filters to find desired music more easily  {cite}`eck2007automatic` {cite}`lamere2008social`. The research on supervised classification evolved over time. In the early 2000s, with advancements in pattern recognition methodologies, the focus was primarily on feature engineering {cite}`fu2010survey`. As we entered the 2010s, with the rise of deep learning, the emphasis shifted towards model engineering {cite}`nam2018deep`.
+
+```{note}
+If you're particularly interested in this area, please refer to the following tutorial:
+
+[ISMIR2008-Tutoiral: SOCIAL TAGGING AND MUSIC INFORMATION RETRIEVAL](https://www.slideshare.net/slideshow/social-tags-and-music-information-retrieval-part-i-presentation)
+[ISMIR2019-Tutoiral: Waveform-based music processing with deep learning](https://zenodo.org/records/3529714)
+[ISMIR2021-Tutoiral: Music Classification: Beyond Supervised Learning, Towards Real-world Applications](https://music-classification.github.io/tutorial/landing-page.html)
+```
+
+However, supervised classification has two fundamental limitations. First, it only supports music understanding and search using fixed labels. This creates a problem where the model cannot handle unseen vocabulary. Second, language labels are represented through one-hot encoding, which means the model cannot capture relationships between different language labels. As a result, the trained model is specifically learned for the given supervision, limiting its ability to generalize and understand a wide range of musical language.
+
+## Early Stage of Music Generation
+
+Compared to Discriminative Models $p(y|x)$, which are relatively easier to model, Generative Models $p(x|c)$ that need to model data distributions initially focused on generating short single-instrument pieces or speech datasets rather than complex multi-track music. In the early stages, unconditioned generation $p(x)$  methods such as likelihood-based models (represented by WaveNet {cite}`van2016wavenet` and SampleRNN {cite}`mehri2016samplernn`) or adversarial models (represented by WaveGAN {cite}`donahue2018adversarial`) were studied. 
+
+Early Conditioned Generation models $p(x|c)$ included the Universal Music Translation Network {cite}`mor2018universal`, which used a single shared encoder and different decoders for each instrument condition, and NSynth {cite}`engel2017neural`, which added pitch conditioning to WaveNet Autoencoders. These models represented some of the first attempts at controlled music generation.
+
+```{note}
+If you're particularly interested in this area, please refer to the following tutorial:
+
+[ISMIR2019-Tutoiral: Waveform-based music processing with deep learning, part 3](https://zenodo.org/records/3529714)
+[Generating Music in the waveform domain - Sander Dieleman](https://sander.ai/2020/03/24/audio-generation.html#fn:umtn)
+```
+
+However, Generative Models capable of Natural Language Conditioning were not yet available at this stage. Despite the challenge of generating high-quality audio with long-term consistency, these early models laid the groundwork for future advancements in music generation technology.
diff --git a/_sources/introduction/scope.md b/_sources/introduction/scope.md
@@ -0,0 +1,8 @@
+# Scope and Application
+
+```{figure} ./img/scpoe.png
+---
+name: scope
+---
+Illustration of the development of music and language models.
+```
diff --git a/bibliography.html b/bibliography.html
@@ -32,7 +32,7 @@
     <link rel="stylesheet" type="text/css" href="_static/styles/sphinx-book-theme.css?v=a3416100" />
     <link rel="stylesheet" type="text/css" href="_static/togglebutton.css?v=13237357" />
     <link rel="stylesheet" type="text/css" href="_static/copybutton.css?v=76b2166b" />
-    <link rel="stylesheet" type="text/css" href="_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css" />
+    <link rel="stylesheet" type="text/css" href="_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css?v=be8a1c11" />
     <link rel="stylesheet" type="text/css" href="_static/sphinx-thebe.css?v=4fa983c6" />
     <link rel="stylesheet" type="text/css" href="_static/sphinx-design.min.css?v=95c83b7e" />
 
@@ -174,13 +174,14 @@
         <ul class="nav bd-sidenav bd-sidenav__home-link">
             <li class="toctree-l1">
                 <a class="reference internal" href="intro.html">
-                    Introduction
+                    Connecting Music Audio and Natural Language
                 </a>
             </li>
         </ul>
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Chapter 1. Introduction</span></p>
 <ul class="nav bd-sidenav">
-<li class="toctree-l1"><a class="reference internal" href="introduction/intro.html">Introduction</a></li>
+<li class="toctree-l1"><a class="reference internal" href="introduction/overview.html">Background</a></li>
+<li class="toctree-l1"><a class="reference internal" href="introduction/scope.html">Scope and Application</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Chapter 2. Overview of Language Model</span></p>
 <ul class="nav bd-sidenav">
@@ -409,29 +410,45 @@ <h1>Bibliography</h1>
 <h1>Bibliography<a class="headerlink" href="#bibliography" title="Link to this heading">#</a></h1>
 <div class="docutils container" id="id1">
 <div role="list" class="citation-list">
-<div class="citation" id="id24" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>DCLT18<span class="fn-bracket">]</span></span>
-<p>Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. <em>arXiv preprint arXiv:1810.04805</em>, 2018.</p>
+<div class="citation" id="id4" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>DMP18<span class="fn-bracket">]</span></span>
+<p>Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. <em>arXiv preprint arXiv:1802.04208</em>, 2018.</p>
 </div>
-<div class="citation" id="id22" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>RKH+21<span class="fn-bracket">]</span></span>
-<p>Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and others. Learning transferable visual models from natural language supervision. In <em>International conference on machine learning</em>, 8748–8763. PMLR, 2021.</p>
+<div class="citation" id="id10" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>ELBMG07<span class="fn-bracket">]</span></span>
+<p>Douglas Eck, Paul Lamere, Thierry Bertin-Mahieux, and Stephen Green. Automatic generation of social tags for music recommendation. <em>Advances in neural information processing systems</em>, 2007.</p>
 </div>
-<div class="citation" id="id21" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>RKX+23<span class="fn-bracket">]</span></span>
-<p>Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In <em>International Conference on Machine Learning</em>, 28492–28518. PMLR, 2023.</p>
+<div class="citation" id="id2" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>ERR+17<span class="fn-bracket">]</span></span>
+<p>Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. In <em>International Conference on Machine Learning</em>. PMLR, 2017.</p>
 </div>
-<div class="citation" id="id25" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>RWC+19<span class="fn-bracket">]</span></span>
-<p>Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and others. Language models are unsupervised multitask learners. <em>OpenAI blog</em>, 1(8):9, 2019.</p>
+<div class="citation" id="id8" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>FLTZ10<span class="fn-bracket">]</span></span>
+<p>Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. A survey of audio-based music classification and annotation. <em>IEEE transactions on multimedia</em>, 2010.</p>
 </div>
-<div class="citation" id="id23" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>RSR+20<span class="fn-bracket">]</span></span>
-<p>Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. <em>Journal of machine learning research</em>, 21(140):1–67, 2020.</p>
+<div class="citation" id="id11" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>Lam08<span class="fn-bracket">]</span></span>
+<p>Paul Lamere. Social tagging and music information retrieval. <em>Journal of new music research</em>, 2008.</p>
 </div>
-<div class="citation" id="id20" role="doc-biblioentry">
-<span class="label"><span class="fn-bracket">[</span>RPG+21<span class="fn-bracket">]</span></span>
-<p>Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In <em>International conference on machine learning</em>, 8821–8831. Pmlr, 2021.</p>
+<div class="citation" id="id5" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>MKG+16<span class="fn-bracket">]</span></span>
+<p>Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. Samplernn: an unconditional end-to-end neural audio generation model. <em>arXiv preprint arXiv:1612.07837</em>, 2016.</p>
+</div>
+<div class="citation" id="id3" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>MWPT18<span class="fn-bracket">]</span></span>
+<p>Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman. A universal music translation network. <em>arXiv preprint arXiv:1805.07848</em>, 2018.</p>
+</div>
+<div class="citation" id="id7" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>NCL+18<span class="fn-bracket">]</span></span>
+<p>Juhan Nam, Keunwoo Choi, Jongpil Lee, Szu-Yu Chou, and Yi-Hsuan Yang. Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. <em>IEEE signal processing magazine</em>, 2018.</p>
+</div>
+<div class="citation" id="id9" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>SLC07<span class="fn-bracket">]</span></span>
+<p>Mohamed Sordo, Cyril Laurier, and Oscar Celma. Annotating music collections: how content-based similarity helps to propagate labels. In <em>ISMIR</em>, 531–534. 2007.</p>
+</div>
+<div class="citation" id="id6" role="doc-biblioentry">
+<span class="label"><span class="fn-bracket">[</span>VDODZ+16<span class="fn-bracket">]</span></span>
+<p>Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, and others. Wavenet: a generative model for raw audio. <em>arXiv preprint arXiv:1609.03499</em>, 2016.</p>
 </div>
 </div>
 </div>

diff --git a/genindex.html b/genindex.html
@@ -182,7 +182,8 @@
         </ul>
         <p aria-level="2" class="caption" role="heading"><span class="caption-text">Chapter 1. Introduction</span></p>
 <ul class="nav bd-sidenav">
-<li class="toctree-l1"><a class="reference internal" href="introduction/intro.html">Introduction</a></li>
+<li class="toctree-l1"><a class="reference internal" href="introduction/overview.html">Background</a></li>
+<li class="toctree-l1"><a class="reference internal" href="introduction/scope.html">Scope and Application</a></li>
 </ul>
 <p aria-level="2" class="caption" role="heading"><span class="caption-text">Chapter 2. Overview of Language Model</span></p>
 <ul class="nav bd-sidenav">