This project explores the application of Machine Learning (ML) and Deep Learning (DL) techniques to perform sentiment analysis on movie reviews. The goal is to automatically classify the sentiment of written movie reviews as positive or negative, which can help in understanding audience preferences and improve film marketing strategies.
- Jerry Yang
- Ngu JiaHao
- Roydon Tay
- Felise Leow
We evaluated various ML and DL models using Python libraries like Scikit-Learn and PyTorch on a dataset of movie reviews. The techniques include:
- Text preprocessing
- Feature Extraction for ML methods (TF-IDF and Count Vectorizer)
- Random Forest, Multinomial Naive Bayes, and Logistic Regression
- BERT and DistilBERT for advanced text representations
The dataset comprises user-generated movie reviews collected from various online platforms. Reviews were preprocessed to remove noise and formatted using techniques such as tokenization, lemmatization, and removal of stopwords. Experiements were conducted for 8-class, 3-class and 2-class classification for ML methods, and 8-class and 2-class for DL, with 2-class yielding highest accuracies in both cases.
Details and discussions about methods used, results and related studies can be found in our project report.
We performed n-gram analysis to identify common phrases in positive and negative reviews. Visualisation plots such as boxplots, countplots and word clouds were used to understand class distribution, conduct feature analysis, and handle outliers in the data.
We trained and experimented the following models:
- Random Forest Classifier: An ensemble model to reduce overfitting and improve accuracy.
- Multinomial Naive Bayes: A probabilistic model ideal for text data classification.
- Logistic Regression: Baseline model for binary classification tasks. Model performance was enhanced using hyperparameter tuning with GridSearchCV and RandomizedSearchCV.
Our DL approach involved fine-tuning DistilBERT, a lighter version of BERT that retains most of the original model's predictive power but is less resource-intensive. This allows us to handle larger data sets more efficiently with limited computation resources. Bayesian optimisation was used for hyperparameter tuning of learning rate and dropout value.
Our models achieved the following accuracy:
Model | Accuracy |
---|---|
Best ML Model (Logistic Regression with TF-IDF) | 91.0% |
Best DL Model (DistilBERT with finetuning) | 93.0% |
Detailed performance metrics including precision, recall, and F1-score are available in the results section of the project report in this repository.
To replicate our findings or use the models:
Clone this repository. Install the required packages from requirements.txt. Run the Jupyter notebooks provided in the code/ directory to train the models on your data. Alternatively, you may run the notebooks on Google Colaboratory from the links in the respective notebooks.