Lyrical Analysis

Big Data Programming Project -- University Junior Year

Mariam Abdelati

Problem Description

Being a music fan myself, I am constantly discovering new music and compiling playlists to match different moods and tastes on spotify. Thus, I became curious to explore how different aspects of a song such as its genre, key signature, and lyrics can affect it's popularity. I also wanted to explore the textual content of the lyrics and discover the sentiment of the songs present nowadays.

Project Table of Contents

Navigation	Description
Python Files
	Data Acquistion
	Retieving Artist Names
	Retieving Artist Data from Spotify
	Retieving Track Details from Spotify and Genius
	Single-Threaded Scraper
	Multi-Threaded Scraper
Notebooks
	Data Cleanup
	EDA
	Sentiment Analysis
ReadMe
	Problem Description
	Data Gathering and Collection
	Cleaning and Obtaining Final Data
	Final Data Columns Dictionary
	Data Storage and Management
	Exploratory Data Analysis (EDA)
	Sentiment Analysis
	Libraries Used
	Next Steps

Data Gathering and Collection

Track Data and Lyrics was gathered with the help of Spotipy and lyricsgenius libraries to facilitate pulling data from the Spotify and Genius APIs. To begin, the artist names for which songs were to be considered were taken from the top 1000 streaming artists on spotify using selenium webdriver. Next the artist data was pulled using spotipy and exported and can be found in ArtistDetails.csv. Next the track data was pulled using spotipy and the lyrics and some related data was retrieved using lyricsgenius. Additionally, there was a direct interaction with the Genius API to retrieve song Ids to maintain accurate song retrieval. Multithreading was adopted at this stage to speed the data retrieval process and make it more efficient. The total number of records retrieved is 159,231 with 30 unique columns and 700 different artists. The data can be found in All_Songs.csv

Cleaning and Obtaining Final Data

After looking through the data collected from the Spotify and Genius APIs, it was noticed that there are several non-lyrics retrieved from the Genius API instead of song lyrics which need to be removed. Additionally, the Spotify API did not return any genres for the songs which makes the genres column insignificant to the analysis. Also, there were several duplicate track uri’s meaning that some tracks were retrieved from the Spotify API more than once, probably due to artists singing on the same track and so the track was scraped for both artists. All these errors are handled in this phase.

The following steps were followed:

Handling duplicates for song uri -> removed
Renaming columns for consistency and descriptiveness
Re-organizing columns in a more logical order
Re-formatted data types for several columns for consistency
Handling errors:
- spotify did not return any genres-> feature engineering
- genius returned the wrong lyrics for a several songs
Cleaning up the lyrics removing line breaks for storage on the hive server

Final Data Columns Dictionary

Track Table

Column Name	Description
track_uri	The Spotify Unique Identifier (ID) for each track.
track_name	Name of Track including Featured Artists.
cleaned_track_name	Name of the Track without Featured Artists.
track_artists	Name of all Artists on the Track including Featured Artists.
featured_artists	Names of Featured Artists on the Track excluding main artist.
track_is_explicit	Boolean indicating whether the song contains explicit content.
track_popularity	The Popularity of the Track as calculated by the Spotify Algorithm using the recent number of plays.
track_genres	The genres of the track as derived from the Artists' Genres performing on the track.
track_duration_ms	The track length in milliseconds
track_time_signature	The time signature of the song which specifies how many beats are in each bar; ranges from 3 to 7 indicating time signatures of "3/4", to "7/4".
track_acousticness	Confidence measure indicating whether a track is acoustic.
track_danceability	Indicates how suitable a track is for dancing based on several musical elements.
track_energy	Indicates the intensity and activity of a song based on several music elements. Typically high energy is characterized by being fast, loud, and noisy.
track_key_signature	The key signature of the song calculated using the key and mode values
track_instrumentalness	Predicts whether or not a track has no vocals. "Ooh" and "Aah" sounds are treated as instrumental. Values above 0.5 are intended to be intrumental.
track_key	The key the track is in represented using integers mapping to pitch class notation
track_mode	Indicates 1 for major scale and 0 for minor scale
track_liveness	Detects the presence of an audience in the recording. Values above 0.8 indicate strong liklihood the track is live
track_loudness	The overall loudness of a track in decibel (dB). Ranges from -60 and 0 dB.
track_speechiness	Detects the presence of spoken words in the track. Values above 0.66 indicate song is mode of spoken words, 0.33-0.66 indicate containing both music and speed in sections of layers. Below 0.33 most likely represents music and other non-speech tracks.
track_tempo	The beats per minute (BPM) in the song or the spead and pace of the track.
track_valence	Describes the track's musical positiveness; whether or not the track sounds positive.
track_lyrics	The Track Lyrics
lyrics_page_views	The number of views for the lyrics on Genius
track_number	The number of the track on the album
album_name	Name of Album where the Track Appears.
album_artist	The Main Artist of the Album where the Track Appears.
album_release_date	The date the album containing the track was released
album_popularity	The Popularity of the Album as calculated by the Spotify Algorithm using the popularity of the songs
album_record_label	The Record Label under which the Album was released
album_cover	The URL of the album cover

Data Storage and Management

The modified csv file is uploaded to the Hadoop Hive server through docker in order to perform map reduce on the data. This concept allows for parallel data processing which gives benefits such as speed and efficiency. Details about set up can be found in Docker-Setup.md

Exploratory Data Analysis (EDA)

With the aim to gain more insight into the data collected, I analyzed some track features with help of graphs and comparisions with other songs. Some of those include:

determining the top 15 artists, who made up almost 19% of all songs collected
determining the top genres: pop and rock made up 51% of all data collected
determining the top years of release: 2020 and 2021 were the top years which is around the time of the pandemic and when tiktok gained more popularity
comparing audio features pre-2000s and post-2000s: it was found that post 2000s music has higher energy which indicates higher intensity
comparing audio features of top songs with all other songs: it was found that top songs have higher average danceability than all songs

Sentiment Analysis

For the purpose of the project, this phase is necessary in order to gain insight into the track lyrics. To begin, the data was analysed to determine the beginning point. It became apparent that the lyrical data was made up of several languages such as Spanish or German. After performing some research on multilingual sentiment analysis, it was concluded that the sentiment analysis process is exactly the same for any other language as it were for English. However, some pre-processing to classify the languages is required in order to specify the language of choice to the text processing library.

It was determined thhat analyzing each language separately would result in the least loss of data of only 2.6%. Data frames for each langauge are thus created and it was time for preprocessing. The steps followed included:

Lowercasing
Removing the first line from lyrics since the genius API returns the name of the song as part of the song lyrics and that is unnecessary for the analysis.
Removing the “Embed” word from the end of the lyrics since the Genius API returned an “embed” word at the end of every lyrics which is also unnessecary for the analysis.
Normalizing contractions for the English language since there are many contractions commonly used which can affect the words of the sentence if decontracted. For other languages, it is difficult to identify the common contractions as the support for the other languages is not widely available, and several contractions can have different meanings in different context. Additionally, most contractions in languages such as French or German are prepositions or pronouns which do not have a huge contribution to sentiment analysis (as opposed to English contractions which shorten verbs) as they are generally considered neutral. Hence, for the purpose of the analysis contractions for other languages will not be considered and this can be an extension to the analysis.
Removing special characters and numbers
Removing short or empty lyrics

Text Normalization then took place to normalize the data for the analysis. The following steps were taken:

Removing Stop Words, or words that do not have much significance to the meaning of the sentence such as pronouns or prepositions. This was done based on each language by using the passing the language of the lyrics as a parameter to the stop words function.
Stemming which is used to remove suffixes from the words in order to reduce words with a common root to the same form. The stemming process is not grammatically correct since it strips words based on a set of rules rather than using linguistic roots based on the part of speech.
Thus, lemmatization was used since it takes into consideration the part of speech for the word i.e., whether the word is a noun, verb adjective, adverb, etc. This approach is more sensible to the analysis as the words are actually part of the language’s vocabulary and it is easy to tell what the word is compared to stemming.

Afterwards, the word count for the most frequent words in the lyrics were determined. This was done by splitting the words and counting the unique ones.

Next, the polarity score of the lyrics was calculated with the help of textblob/vader in order to gain insight on whether the lyrics collected are more positive or negative based on the words in those lyrics.

Libraries Used

selenium webdriver
pandas
concurrent futures
spotipy
lyricsgenius
time
numpy
requests
regex
matplotlib
swifter
seaborn
plotly
langid
nltk
wordcloud
textblob
pattern

Next Steps

The next steps is to continue performing a more detailed EDA to get further insights into song data as well as understand the trends in music features patterns. Additionally, a dashboard can be used to display the figures and graphs in a more organized and clear manner and allow users to select the audio features and settings of their choice. The project can also be further extended to have a recommendation engine for suggesting similar artists. Machine Learning can also be used to train models to analyze audio features and lyrics.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
1_Data_Acquisition		1_Data_Acquisition
2_Data_Cleaning		2_Data_Cleaning
3_Hadoop		3_Hadoop
4_EDA		4_EDA
5_Sentiment_Analysis		5_Sentiment_Analysis
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
Docker-Setup.md		Docker-Setup.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lyrical Analysis

Mariam Abdelati

Problem Description

Project Table of Contents

Data Gathering and Collection

Cleaning and Obtaining Final Data

Final Data Columns Dictionary

Track Table

Data Storage and Management

Exploratory Data Analysis (EDA)

Sentiment Analysis

Libraries Used

Next Steps

About

Releases

Packages

Languages

License

mariamabdelati/lyricalanalysis

Folders and files

Latest commit

History

Repository files navigation

Lyrical Analysis

Mariam Abdelati

Problem Description

Project Table of Contents

Data Gathering and Collection

Cleaning and Obtaining Final Data

Final Data Columns Dictionary

Track Table

Data Storage and Management

Exploratory Data Analysis (EDA)

Sentiment Analysis

Libraries Used

Next Steps

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages