Multi-class sentiment analysis problem to classify texts into five emotion categories: joy, sadness, anger, fear, neutral. A fun weekend project to go through different text classification techniques. This includes dataset preparation, traditional machine learning with scikit-learn, LSTM neural networks and transfer learning using BERT (tensorflow keras).
Summary Table
Dataset | Year | Content | Size | Emotion categories | Balanced |
---|---|---|---|---|---|
dailydialog | 2017 | dialogues | 102k sentences | neutral, joy, surprise, sadness, anger, disgust, fear | No |
emotion-stimulus | 2015 | dialogues | 2.5k sentences | sadness, joy, anger, fear, surprise, disgust | No |
isear | 1990 | emotional situations | 7.5k sentences | joy, fear, anger, sadness, disgust, shame, guilt | Yes |
links: dailydialog, emotion-stimulus, isear
Dataset was combined from dailydialog, isear, and emotion-stimulus to create a balanced dataset with 5 labels: joy, sad, anger, fear, and neutral. The texts mainly consist of short messages and dialog utterances.
- Data preprocessing: noise and punctuation removal, tokenization, stemming
- Text Representation: TF-IDF
- Classifiers: Naive Bayes, Random Forrest, Logistic Regrassion, SVM
Approach | F1-Score |
---|---|
Naive Bayes | 0.6702 |
Random Forrest | 0.6372 |
Logistic Regression | 0.6935 |
SVM | 0.7271 |
- Data preprocessing: noise and punctuation removal, tokenization
- Word Embeddings: pretrained 300 dimensional word2vec (link)
- Deep Network: LSTM, biLSTM, CNN
Approach | F1-Score |
---|---|
LSTM + w2v_wiki | 0.7395 |
biLSTM + w2v_wiki | 0.7414 |
CNN + w2v_wiki | 0.7580 |
Finetuning BERT for text classification
Approach | F1-Score |
---|---|
finetuned BERT | 0.8320 |