RojakLanguageSentimentAnalysis

This is a machine learning project focused on analysing and classifying sentiments in code-switched and code-mixed text, specifically targeting the unique linguistic characteristics found in Malaysian conversations.

Introduction

RojakLanguageSentimentAnalysis is a machine learning project designed to analyze and classify sentiments within code-switched and code-mixed text, specifically focusing on Malaysian linguistic patterns. This project tackles the unique challenges of multilingual sentiment analysis by employing both deep learning and traditional machine learning models, offering a comprehensive approach to understanding sentiments in a linguistically diverse context.

Installation

To set up the project environment, follow these steps:

Clone the repository:

git clone https://github.com/Wei-RongRong2/RojakLanguageSentimentAnalysis

Navigate to the project directory:
```
cd RojakLanguageSentimentAnalysis
```
Install the required Python packages:
```
pip install -r requirements.txt
```

Usage

To run the clustering analysis, follow these steps:

Ensure you have Jupyter Notebook installed. If not, you can install it using:
```
pip install notebook
```
Navigate to the project directory where the Jupyter Notebook is located:
```
cd RojakLanguageSentimentAnalysis
```
Launch Jupyter Notebook:
```
jupyter notebook
```
In the Jupyter Notebook interface, open the RojakLanguageSentimentAnalysis.ipynb file.
Run the cells in the notebook to execute the clustering analysis.

Web Application

The web_app folder contains the code for the website development. This includes:

app.py: The Flask application used to serve the web application.
templates/: Contains the HTML, CSS, and JavaScript files that define the front-end of the web application.
static/: Contains static assets such as images.

Running the Web Application

To run the web application locally:

Navigate to the web_app directory:
```
cd web_app
```
Install the required Python packages:
```
pip install -r requirements.txt
```
Run the Flask application:
```
python app.py
```

This will start the web application locally, and you can access it by navigating to http://127.0.0.1:5000 in your web browser.

The web application is also deployed on Render and can be accessed online at Malaysian Rojak Language Sentiment Analysis.

GUI Application

In addition to the web application, you can also run a graphical user interface (GUI) application:

Navigate to the web_app directory:
```
cd web_app
```
Install tkinter if it's not already installed:
```
pip install tk
```
Run the GUI application:
```
python GUI_model_deployment.py
```

This will launch a desktop application using Tkinter, providing an interface to make predictions.

Methodology

This project focused on sentiment analysis of Malaysian code-switched and code-mixed text, using data from Reddit and Hugging Face. Key steps included:

Data Collection

Source: A fusion of two datasets: a Reddit dataset from the Malaysia subreddit and a Twitter rojak dataset from mesolitica on Hugging Face.
Reddit Dataset: Derived from the Malaysia subreddit using the Reddit API, capturing diverse discussions within the Malaysian community. The Malaya library was used to identify and gather Rojak languages.
Twitter Rojak Dataset: Sourced from Hugging Face's language-detection-dataset, focusing on Twitter Rojak records.

Preprocessing

Data Cleaning: Removed duplicates, converted emojis to text, expanded contractions, and handled reduplicated words. Noise, including URLs and usernames, was removed while retaining punctuation for segmentation.
Segmentation & Tokenization: Used Malaya HuggingFace for sentence segmentation and NLTK for tokenization, enabling precise analysis of code-switched text.
Language Detection & Stemming: Detected language at the word level with Malaya’s FastText; applied stemming/lemmatization based on language.
Normalization: Replaced abbreviations and removed redundant or non-standard words using regex rules.
Named Entity Recognition (NER): Applied NER using Malaya and SpaCy, with challenges in code-mixed text.
Data Splitting: Divided the data into training, validation, and test sets (70-15-15 ratio).
Feature Extraction: Used TF-IDF, PCA (95% variance), and Truncated SVD for dimensionality reduction.

Model Training

Multinomial Naive Bayes (MultinomialNB): Utilized for text classification due to its efficiency with count-based features like word frequencies. Trained on TF-IDF features to capture text patterns.
Support Vector Machine (SVM): Chosen for its ability to handle high-dimensional text data and its versatility in finding the optimal hyperplane in the TF-IDF feature space. Also trained on TruncatedSVD-reduced features for enhanced speed and robustness.
Long Short-Term Memory (LSTM): A deep learning model used to capture sequential and long-range dependencies in text data. The LSTM-based neural network was designed to understand contextual flow and nuanced sentiment patterns.

Results

Model Evaluation Results

Multinomial Naive Bayes (MNB):
- Improved accuracy, recall, and F1 score post-tuning, though precision slightly decreased, indicating more false positives.
Support Vector Machine (SVM):
- No significant change after tuning, suggesting default parameters were optimal, showing stability in performance.
Truncated SVM:
- Marginal changes in performance, indicating the model likely reached its peak with the given features.
Long Short-Term Memory (LSTM):
- Significant improvements in all metrics post-tuning, highlighting its strength in capturing temporal dependencies and reducing overfitting.

Best Model Configuration

Accuracy: LSTM (65%) slightly outperformed SVM.
Precision vs. Recall Balance: LSTM provides the best balance.
Complexity and Interpretability: SVM and MNB are simpler to interpret and quicker to train, ideal for scenarios requiring interpretability or efficiency.

For a more detailed explanation of these steps and results, refer to the full report: Report - Sentiment Analysis on Out-Of-Vocabulary (OOV) Malaysia Rojak Language.pdf.

Contributing

Contributions are welcome! Please fork this repository, make your changes in a new branch, and submit a pull request for review.

Fork the repo
Create a feature branch (git checkout -b feature-name)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin feature-name)
Create a new Pull Request

Acknowledgments

This project was developed in collaboration with naruto sun. We worked together on the sentiment analysis, model development, and project documentation.

License

This project is part of an academic course and is intended for educational purposes only. It may contain references to copyrighted materials, and the use of such materials is strictly for academic use. Please consult your instructor or institution for guidance on sharing or distributing this work.

For more details, see the LICENSE file.

Contact

Created by Wrrrrr - feel free to contact me!
For any inquiries, you can also reach out to naruto sun

References

Reddit Dataset: Reddit API
Twitter Rojak Dataset: Hugging Face - Mesolitica
Natural Language Processing: Malaya Library Documentation
Machine Learning Algorithms: Scikit-Learn Documentation
Evaluation Metrics: Accuracy, Precision, Recall, F1 Score

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RojakLanguageSentimentAnalysis

Table of Contents

Introduction

Installation

Usage

Web Application

Running the Web Application

GUI Application

Methodology

Data Collection

Preprocessing

Model Training

Results

Model Evaluation Results

Best Model Configuration

Contributing

Acknowledgments

License

Contact

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.devcontainer		.devcontainer
.idea		.idea
dataset		dataset
web_app		web_app
LICENSE		LICENSE
README.md		README.md
Report - Sentiment Analysis on Out-Of-Vocabulary (OOV) Malaysia Rojak Language.pdf		Report - Sentiment Analysis on Out-Of-Vocabulary (OOV) Malaysia Rojak Language.pdf
RojakLanguageSentimentAnalysis.ipynb		RojakLanguageSentimentAnalysis.ipynb
requirements.txt		requirements.txt

License

Wei-RongRong2/RojakLanguageSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

RojakLanguageSentimentAnalysis

Table of Contents

Introduction

Installation

Usage

Web Application

Running the Web Application

GUI Application

Methodology

Data Collection

Preprocessing

Model Training

Results

Model Evaluation Results

Best Model Configuration

Contributing

Acknowledgments

License

Contact

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages