Sentimental analysis of tweets to detect racist/sexist tweets using Bag of Words, TF-IDF Features, Word2Vec and Doc2Vec.
- Python
- Tweepy
- TextBlob
- Pandas
- nltk (text manipulation)
- re (regular expression)
- Install TextBlob
pip install -U textblob
- Additional Dependency
python -m textblob.download_corpora
git clone git@github.com:aakashjhawar/twitter-sentiment-analysis.git
cd twitter-sentiment-analysis
Open the jupyter notebook file.
- Remove @user from tweets
- Remove punctuations and numbers from tweets
replace("[^a-zA-Z#]", " ")
- Remove words whose length is less than 3
word < str.len(3)
- Text Normalization using Porter Stemmer
Porter Stemmer
- Tokenize the tweets
- Normalize the tweets
- Stich it back using nltk's Moses Detokenizer
Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set. In simple terms, it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear.
Term Frequency-Inverse Document Frequency. It Penalise the most common words by assigning them lower weights while giving imortance to words which are rare in corpus but apper in good number in few documents.
TF = (Number of times term 't' appears in a doc) / (Number of terms in doc)
IDF = log(N/n)
where, N = number of document
n = number of documents a term 't' has appeared in
TF-IDF = TF*IDF
It represents each word as a vector(Word Emebddings). The objective is to redefine high dimensional word features into low dimensional features by preserving contexual similarity in corpus.
Example
King - Man + Woman = Queen
Advantages>
- Dimensionality reduction: Significant reduction in number of features
- It captures the meaning of word i.e., semantic erlations and different types of context
Word2Vec is combination of two algorithms:
- CBOW - Continous Bag of Words: Predict probability of given context
- Skip-Gram Model: Capture different semantic for sigle word. eg. Apple is company as well as fruit.
Both are shallow Neural Network. They map words to target variables which are also words.
Unsupervised algorithm to generate vector for sentence, paragraph and documents. It provides an additional context which is unique for every document in corpus. Doc vector is trained along with Word vector.
To implement Doc2Vec Algorithm, labelize/tag each tokenised tweet with unique ids. (Genism's LabledSentence)
Evaluation Matrix: F1 score is used as EM. It is weighed average of Precision and Recall.
General approach for parameter tuning
- Choose relatively high learning rate (0.3 usually)
- Tune tree-specific parameters, eg. max_depth, min_child_weight, subsample etc.
- Tune learning rate
- Finally tune gamma to avoid overfitting
Accuracy of Bag of Words, TF-IDF Features, Word2Vec and Doc2Vec over Logistic Regression, Support Vector Machine(SVM), Random Forest and XGBoost classifier.
Classifier | Bag of Words | TF-IDF Feature | Word2Vec | Doc2Vec |
---|---|---|---|---|
Logistic Regression | 53.05% | 54.46% | 61.72% | 36.9% |
Support Vector Machine | 50.6% | 51% | 61.30% | 19.32% |
Random Forest | 53.29% | 56.21% | 50.2% | 65.06% |
XGBoost | 51.30% | 51.85% | 64.54% | 34.83% |