This repo contains a single Jupyter Notebook (Python) for the Deep Learning Course at the MSc Program of AI 2020-2022, organized by NCSR Demokritos and University of Piraeus.
Instructor: Mr. Theodoros Giannakopoulos - tygiannak
-
matplotlib, plotly and seaborn for Visualization
-
numpy for Matrices
-
opencv for Computer Vision
-
librosa for Music Analysis
-
sklearn for Machine Learning
-
keras for Deep Learning
- Approximately 10k Greek songs were collected spanning from the Interwar Period (1920-1940) and Greek Junta / Dictatorship (1967-1974), until today (2021)
- The songs collected were in various forms (.mp3, .wma, .wav) and their aggregated size exceeds 120 GB
- Initial Dataset: https://raw.githubusercontent.com/mzouros/dl_gmsa/main/Greek_Songs_List.txt
- Data cleanup:
- Remove any unnecessary files (.txt, .jpg, .zip, etc)
- Delete songs with generic names (eg. Track01)
- Delete duplicate songs
- Try and narrow down live performance songs
- Data matching:
- Create folders for each Singer OR Songwriter OR Lyricist OR Producer (according to Spotify's registries) and assort each song to a single specific folder (locally)
- Create the same folders on Spotify, then search which of the songs in our DB is a registry on Spotify. For each matching, insert the song to its corresponding folder
- 1 by 1 matching each song to its Spotify counterpart. Rename each song of the dataset according to Spotify's Track name registries
- Deal with bad matching / mispellings
- Validate Playlist names (Spotify) against Album names (locally)
- Validate Playlist track names (Spotify) against Album track names (locally)
- Feature extraction via Spotify's API and export data to .csv
- Some of the features extracted were Danceability, Energy, Valence, Liveness and Loudness
- Song Features: https://github.com/mzouros/dl_gmsa/blob/main/Spotify_Tracks.csv
- Transform .mp3 and .wma files to .wav
- Resample all tracks to 8k Hz and monophonic (mono) sound
- Create songs' sentiments classification logic, according to Arousal-Valence Model
- Classify each song to a specific sentiment according to its values from the .csv file
- Optical
- Waveforms
- Spectograms
- Mel Spectograms & Laplacian Mel Spectograms
- Chromas
- Tonnetz
- CENS
- Features
- MFCCs
- STFTs
- ZCRs
After experimenting with the initial dataset:
- Consider more data (via SpecAugmentation - Frequency Mask)
- Reconsider total of images per sentiment category (500 images for each category - 4k images), so to have the perfect balanced set
- Reconsider image size from 128x128 to 256x256
- Lots of different architectures have been tested, resulting on similar results (<0.4 validation accuracy)
- In all the architectures our model seems to not be able to learn after a while
- Tried most of overfitting avoidance techniques (kernel regularization, batch normalization, dropout, ES)
- Tried with different model sizes (layers, nodes), batch sizes, kernel & stride sizes, dropout values, number of epochs
- Tried with a perfect balanced set of 500 samples for each sentiment (4k samples total) - after data augmentation
- Problems during implementation:
- 1 by 1 matching very time consuming
- Lots of duplicates required extra work (eg Stavros Xarxakos and Stavros Ksarhakos, both indicating the same Artist)
- Spotify classifies as Artists singers, songwriters, lyricists and producers. Need to take all of those into consideration when searching if a specific song in our dataset exists as a registry in Spotify
- Annotating an audio recording is challenging. How many emotions should we define to recognize?
- Emotions are subjective, people would interpret it differently. It is hard to define the notion of emotions.
- Audio Analysis through image classification tend to have bad accuracy results (<0.5) (https://towardsdatascience.com/whats-wrong-with-spectrograms-and-cnns-for-audio-processing-311377d7ccd)
- Food for thought:
- How Spotify quantify its track's variables is not clear, but usually in fuzzy logic, a team of specialists define the pertinence degree between qualitative aspects
- A happy song may have sad lyrics and vice versa. A perfect song sentiment analysis would require both text and sound analysis and classification.
Experiment with:
- Bigger Dataset (10k+ images)
- Bigger image size (maybe 512x512)
- Pre-trained Classifiers
- Classification in 4 sentiments (Happy, Calm, Sad, Angry) instead of 8
- Different NN architectures (eg CNN-RNN in parallel)
- Instead of feature extraction through Spotify's API, use an Annotation App, so volunteers individuals can annotate according to their belief (eg AnnoEmo app)
- Michael Zouros - mzouros
This project is licensed under the MIT License - see the LICENSE.md file for details
- Theodoros Giannakopoulos - tygiannak