Skip to content

lixx21/Twitter_Sentiment_NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter_Sentiment_NLP

Overview

       Sentiment Analysis is a NLP (Natural Language Processing) problem to determine whether the sentiment is positive or negative. In this case, we use twitter's sentiment to deterimine whether is it positive, negative, neutral or irrelevant.

App

       I built this application using several tools, libraries and frameworks. Especially for notebook I used Google Colab to help me built the model from notebook

  1. Tensorflow
  2. pandas
  3. seaborn
  4. matplotlib
  5. sklearn
  6. numpy
  7. zipfile
  8. html
  9. FastAPI

Run App

  1. Go to web directory in terminal
  2. Run the application using this command uvicorn app:app
  3. Go to the link http://127.0.0.1:8000/

App Overview

image

Dataset

       The dataset can be download in kaggle - Twitter Sentiment Analysis

image

       There are 2 csv in this zipfile:

  • twitter_training.csv: for training model

  • twittter_validation.csv: for validation data

In this notebook I only use twitter_training.csv

Notebook

Exploratory Data Analysis

1. Show first five records in dataset

image

       From that picture, we can see that there are no columns name, therefore we add name for each columns. From adding columns name can help us explore this dataset easily.

image

2. We check shape of dataset that there are 74681 rows and 4 columns

3. Check missing values in the dataset

image

       We found that there are 686 missing values in tweet_content column, therefore we need to handle it, in this case I remove them and got 73995 rows and 4 columns after removing the missing values

4. Drop unnecessary columns

       I dropped tweet_id and entity columns because we did not need that.

5. Check label

       I checked label and sum of the values to prevent imbalanced data and the data seems balanced after checked the label data.

image

Data Preprocessing

1. One Hot Encoding

       First thing that I did in preprocessing steps is one hot encoding the label data, because label data is categorical data and not numerical. To handle this problem I used pd.get_dummies() to one hot encoding label or sentiment column.

image

2. Change column into numpy array

       To process the dataset we need to change each columns into numpy array to helps us tokenize them.

image

3. Split data

       Then I splitted the dataset into train_tweet, test_tweet, train_label and test_label with train size 80% and test size 20% and random_state = 42. Then we got this shape:

image

4. Tokenizer

       After that I am using tokenizer to tokenize train_tweet with num_words is 10000 and change unknown character with . After fit tokenizer into train_tweet, I am using tokenizer.texts_to_sequences() to change texts in train_tweet and test_tweet into sequences.

5. Add padding

       To handle different length in each sequences I used pad_sequences in train_sequences and test_sequences to padding each sequence with parameter max_len = 150, padding='post' therefore the additional values or padding can be in the back of sequence, and truncating = 'post' therefore we crop sequence from back.

Build Model

       I build model with tensorflow and using Embedding layers and Bidirectional LSTM layers to help me train my model I used input_dim = 10000, output_dim = 16, and input_length = 150.

image

Evaluate Model

       I am using callback and my callback stop the training in 9 epochs and I got 92% accuracy, 84% val_accuracy, 19% loss and 59% val_loss.

image

About

Sentiment analysis app on twitter using NLP model

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published