First, we will download the twitter sentiment dataset from Kaggle the link is given below:
Dataset Link: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset?select=train.csv
After downloading this the word in labels as follows:
Negative Sentiment: 0
Positive Sentiment: 1
After this we will upload the data on our drive that we will mount on the Google Colab.
Create a new notebook in Colab
Go on the Runtime tab and change the Runtime type to GPU and save it.
Mount the Drive in which you uploaded the Dataset you want to train the model.
Output
Now we will run to vectorize the words.
And load it into the dictionary function so it can be called directly.
Download wordnet and omw file to deal with string data from NLP library After this we will tokenize the data. Which divides the data into small parts, so the sentences become easier to use or translate into other language or format. After this we will Lemmatizer our data. Lemmatizer will convert the words to the most common form of that word or the most used similar word so it is easier to detect.
Required Lib:
nltk.download('omw-1.4')
nltk.download('wordnet')
Output Before tokenizing:
After tokenizing:
Before Lemmatizer:
After Lemmatizer:
Now we will create a function that will loop and vectorize the data and return as a float/decimal valued data. Now we will divide the data into training and testing with 70% for training and 30% for testing which is considered a standard.
We will define df_to_X_y so we can count the sentences size
Then will plot to see the max and min num of tokens our dataset has in a sentence.
After this we will pad our dataset to the max size of the token, so we don't get bad results. So, our training is accurate, and our shape of data is relevant.
Function for counting the token size for our dataset
Now we print and get an output graph for the token size
We will create our model by designing Lstm, Dropout, Dense and flatten layer. We will get lstm layers to assign weights Dropout to remove overfitting Dense to get the most values Then flatten to get the output layer.
We are using shape 57 as the max token size is 50 Using three LSTM layers and 64 sized filter Dropout of 0.2 so our model doesn't over fit Flatten over layer in the end And dense to get the values and sigmoid activation function because it's a binary class.
First, we are defining the location where we want to save our trained weights.
Optimizer we used is Adam.
Loss is Binary Crossentropy.
Now we will fit our model using the fit command. We are giving 20 epochs to train our data.
Now we will load our trained model and the weights. The give it a test data to get the accuracy of our model given below.
Output:
We have accuracy of positive sentiment: 88 %
We have accuracy of negative sentiment: 81 %
Output:
Actual: Positive Sentiment
Predicted: Positive Sentiment
Actual: Negative Sentiment
Predicted: Negative Sentiment