Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tripadvisor Reviews Classification using NLP #382

Merged
merged 3 commits into from
Dec 13, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TripAdvisor Reviews/Dataset/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
https://www.kaggle.com/datasets/arnabchaki/tripadvisor-reviews-2023/data
Binary file added TripAdvisor Reviews/Dataset/archive (5).zip
Binary file not shown.
1 change: 1 addition & 0 deletions TripAdvisor Reviews/Images/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Used Wordclouds and Bar Charts for EDA
Binary file added TripAdvisor Reviews/Images/Screenshot (257).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added TripAdvisor Reviews/Images/Screenshot (259).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
64 changes: 64 additions & 0 deletions TripAdvisor Reviews/Models/ReadMe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
**PROJECT TITLE**
TripAdvisor Reviews Classification Using Deep Learning

**GOAL**
To classify TripAdvisor reviews using Deep Learning

**DATASET**
https://www.kaggle.com/datasets/arnabchaki/tripadvisor-reviews-2023/data

**DESCRIPTION**
The project compares accuracy of Glove embeddings with GRU to that of Logistic Regression with Tfidf Vectorizer.This project involves combining traditional machine learning with deep learning techniques to classify sentiment in TripAdvisor reviews, offering a comprehensive exploration of different methods for text classification.

**WHAT I HAD DONE**
1. TF-IDF Vectorizer and Logistic Regression:
a. Data Preparation:
Dataset: Collect a dataset of TripAdvisor reviews with labeled sentiments .
Preprocessing: Clean and preprocess the text data, including steps like removing stop words, stemming, and handling special characters.
b. Feature Extraction:
TF-IDF Vectorizer: Convert the preprocessed text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. This technique assigns weights to words based on their importance in the corpus.
c. Model Building:
Logistic Regression: Train a logistic regression classifier using the TF-IDF vectors as input features. Logistic Regression is a commonly used algorithm for binary classification tasks.
d. Model Evaluation:
Split Data: Split the dataset into training and testing sets.
Evaluate Performance: Assess the model's performance using metrics like accuracy, precision, recall, and F1-score on the test set.
2. Glove Embeddings and GRU:
a. Data Preparation:
Embeddings: Use pre-trained Glove embeddings to convert words into dense vectors. Glove embeddings capture semantic relationships between words.
Padding: Ensure that all sequences of reviews have the same length by padding or truncating them.
b. Model Architecture:
GRU Neural Network: Build a neural network using GRU layers for sequential data processing. GRUs are a type of recurrent neural network (RNN) that can capture long-term dependencies in sequential data.
c. Model Training:
Transfer Learning: Fine-tune the pre-trained Glove embeddings or use them as fixed weights.
Training: Train the GRU neural network on the labeled TripAdvisor reviews.
d. Model Evaluation:
Validation: Evaluate the performance of the GRU model on a validation set to tune hyperparameters.
Test Set Evaluation: Assess the final model's performance on a separate test set using appropriate evaluation metrics.

**MODELS USED**
Logistic Regression, RNN

**LIBRARIES NEEDED**
Pandas, Numpy, Keras,TensorFlow, ScikitLearn, Seaborn, Matplotlib,NLTK

**VISUALIZATION**
EDA Results in Images folder

**ACCURACIES**
Around 94% for Logistic Regression
Around 85% for GRU with Glove Embeddings

**CONCLUSION**
*TF-IDF Vectorization:*
Use the TF-IDF Vectorizer to convert the preprocessed text data into numerical features.
TF-IDF assigns weights to words based on their importance in a document relative to the entire corpus.
This results in a sparse matrix where each row represents a document, and each column represents a unique word with its corresponding TF-IDF weight.

*GloVe Embeddings*
GloVe embeddings capture semantic relationships between words based on their co-occurrence statistics in a large corpus. This helps the model to understand the meaning and context of words in a more nuanced way compared to traditional one-hot encoding.
*GRU*
GRU, being a type of recurrent neural network (RNN), is effective in capturing sequential dependencies in data. It can understand the relationships between words in a sequence, which is crucial for understanding the meaning of sentences and paragraphs.

**YOUR NAME**
Aindree Chatterjee

1 change: 1 addition & 0 deletions TripAdvisor Reviews/Models/gru-tripadvisor-reviews.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions TripAdvisor Reviews/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Pandas, Numpy, Keras,TensorFlow, ScikitLearn, Seaborn, Matplotlib,NLTK

(import GLove Embeddings in Kaggle to work)
Loading