This is a machine learning project focused on analysing and classifying sentiments in code-switched and code-mixed text, specifically targeting the unique linguistic characteristics found in Malaysian conversations.
- Introduction
- Installation
- Usage
- Web Application
- GUI Application
- Methodology
- Results
- Contributing
- License
- Contact
- References
RojakLanguageSentimentAnalysis is a machine learning project designed to analyze and classify sentiments within code-switched and code-mixed text, specifically focusing on Malaysian linguistic patterns. This project tackles the unique challenges of multilingual sentiment analysis by employing both deep learning and traditional machine learning models, offering a comprehensive approach to understanding sentiments in a linguistically diverse context.
To set up the project environment, follow these steps:
- Clone the repository:
git clone https://github.com/Wei-RongRong2/RojakLanguageSentimentAnalysis
- Navigate to the project directory:
cd RojakLanguageSentimentAnalysis
- Install the required Python packages:
pip install -r requirements.txt
To run the clustering analysis, follow these steps:
-
Ensure you have Jupyter Notebook installed. If not, you can install it using:
pip install notebook
-
Navigate to the project directory where the Jupyter Notebook is located:
cd RojakLanguageSentimentAnalysis
-
Launch Jupyter Notebook:
jupyter notebook
-
In the Jupyter Notebook interface, open the
RojakLanguageSentimentAnalysis.ipynb
file. -
Run the cells in the notebook to execute the clustering analysis.
The web_app
folder contains the code for the website development. This includes:
- app.py: The Flask application used to serve the web application.
- templates/: Contains the HTML, CSS, and JavaScript files that define the front-end of the web application.
- static/: Contains static assets such as images.
To run the web application locally:
- Navigate to the
web_app
directory:cd web_app
- Install the required Python packages:
pip install -r requirements.txt
- Run the Flask application:
python app.py
This will start the web application locally, and you can access it by navigating to http://127.0.0.1:5000
in your web browser.
The web application is also deployed on Render and can be accessed online at Malaysian Rojak Language Sentiment Analysis.
In addition to the web application, you can also run a graphical user interface (GUI) application:
- Navigate to the
web_app
directory:cd web_app
- Install
tkinter
if it's not already installed:pip install tk
- Run the GUI application:
python GUI_model_deployment.py
This will launch a desktop application using Tkinter, providing an interface to make predictions.
This project focused on sentiment analysis of Malaysian code-switched and code-mixed text, using data from Reddit and Hugging Face. Key steps included:
- Source: A fusion of two datasets: a Reddit dataset from the Malaysia subreddit and a Twitter rojak dataset from mesolitica on Hugging Face.
- Reddit Dataset: Derived from the Malaysia subreddit using the Reddit API, capturing diverse discussions within the Malaysian community. The Malaya library was used to identify and gather Rojak languages.
- Twitter Rojak Dataset: Sourced from Hugging Face's language-detection-dataset, focusing on Twitter Rojak records.
- Data Cleaning: Removed duplicates, converted emojis to text, expanded contractions, and handled reduplicated words. Noise, including URLs and usernames, was removed while retaining punctuation for segmentation.
- Segmentation & Tokenization: Used Malaya HuggingFace for sentence segmentation and NLTK for tokenization, enabling precise analysis of code-switched text.
- Language Detection & Stemming: Detected language at the word level with Malaya’s FastText; applied stemming/lemmatization based on language.
- Normalization: Replaced abbreviations and removed redundant or non-standard words using regex rules.
- Named Entity Recognition (NER): Applied NER using Malaya and SpaCy, with challenges in code-mixed text.
- Data Splitting: Divided the data into training, validation, and test sets (70-15-15 ratio).
- Feature Extraction: Used TF-IDF, PCA (95% variance), and Truncated SVD for dimensionality reduction.
-
Multinomial Naive Bayes (MultinomialNB): Utilized for text classification due to its efficiency with count-based features like word frequencies. Trained on TF-IDF features to capture text patterns.
-
Support Vector Machine (SVM): Chosen for its ability to handle high-dimensional text data and its versatility in finding the optimal hyperplane in the TF-IDF feature space. Also trained on TruncatedSVD-reduced features for enhanced speed and robustness.
-
Long Short-Term Memory (LSTM): A deep learning model used to capture sequential and long-range dependencies in text data. The LSTM-based neural network was designed to understand contextual flow and nuanced sentiment patterns.
-
Multinomial Naive Bayes (MNB):
- Improved accuracy, recall, and F1 score post-tuning, though precision slightly decreased, indicating more false positives.
-
Support Vector Machine (SVM):
- No significant change after tuning, suggesting default parameters were optimal, showing stability in performance.
-
Truncated SVM:
- Marginal changes in performance, indicating the model likely reached its peak with the given features.
-
Long Short-Term Memory (LSTM):
- Significant improvements in all metrics post-tuning, highlighting its strength in capturing temporal dependencies and reducing overfitting.
- Accuracy: LSTM (65%) slightly outperformed SVM.
- Precision vs. Recall Balance: LSTM provides the best balance.
- Complexity and Interpretability: SVM and MNB are simpler to interpret and quicker to train, ideal for scenarios requiring interpretability or efficiency.
For a more detailed explanation of these steps and results, refer to the full report: Report - Sentiment Analysis on Out-Of-Vocabulary (OOV) Malaysia Rojak Language.pdf.
Contributions are welcome! Please fork this repository, make your changes in a new branch, and submit a pull request for review.
- Fork the repo
- Create a feature branch (
git checkout -b feature-name
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin feature-name
) - Create a new Pull Request
This project was developed in collaboration with naruto sun. We worked together on the sentiment analysis, model development, and project documentation.
This project is part of an academic course and is intended for educational purposes only. It may contain references to copyrighted materials, and the use of such materials is strictly for academic use. Please consult your instructor or institution for guidance on sharing or distributing this work.
For more details, see the LICENSE file.
Created by Wrrrrr - feel free to contact me!
For any inquiries, you can also reach out to naruto sun
- Reddit Dataset: Reddit API
- Twitter Rojak Dataset: Hugging Face - Mesolitica
- Natural Language Processing: Malaya Library Documentation
- Machine Learning Algorithms: Scikit-Learn Documentation
- Evaluation Metrics: Accuracy, Precision, Recall, F1 Score