Dataset: https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus
The data includes 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity.
Also, if you happen to refer my work, a citation would do wonders for me. Thanks!
Salunkhe, Ashish. "Attention-based Bidirectional LSTM for Deceptive Opinion Spam Classification." arXiv preprint arXiv:2112.14789 (2021).
The following implementations are done:
- Bidirectional LSTM with GLoVE 50D word embeddings
- LSTM with GLoVE 100D word embeddings
- LSTM with GLoVE 300D word embeddings
- CNN-LSTM with Doc2Vec and TF-IDF
- Attention mechanism with GLoVe 100D word embeddings
- Logistic Regression
- Multinomial Naive Bayes
- Support Vector Machine - Stochastic Gradient Descent (SGD)
The results obtained were as follows:
Sr. No. | Model Accuracy (%) | Precision Score | Recall Score | F1 Score |
---|---|---|---|---|
1 | MultinomialNB | 90.25 | 0.9325 | 0.8601 |
2 | Stochastic Gradient Descent (SGD) | 87.75 | 0.8913 | 0.8497 |
3 | Logistic Regression | 87.00 | 0.8691 | 0.8601 |
4 | Support Vector Machine | 56.25 | 0.525 | 0.9792 |
5 | Gaussian Naive Bayes | 63.5 | 0.6424 | 0.6169 |
6 | K-Nearest Neighbour | 57.5 | 0.8604 | 0.1840 |
7 | Decision tree | 68.5 | 0.6681 | 0.7412 |
Model | Training accuracy(%) | Testing accuracy(%) |
---|---|---|
Bidirectional LSTM + GLoVe(50D) | 92.17 | 88.13 |
LSTM + GLoVe(100D) | 99.18 | 85.75 |
CNN + LSTM + Doc2Vec +TF-IDF | 96.23 | 92.19 |
CNN + Attention + GLoVe(100D) | 99.00 | 90.25 |
BiLSTM + Attention + GLoVe(100D) | 99.18 | 89.27 |
CNN + BiLSTM + Attention + GLoVe(100D) | 99.75 | 81.25 |
LogisticRegression + TF-IDF | 99.11 | 87.21 |
Future scope includes improvement in the attention layer to increase testing accuracy. BERT and XLNet can be implemented to improve the performance further.