This project is divided into two main parts:
-
Text Classification:
- Implemented using a Rule-Based Classifier.
- Implemented using a Bag of Words Classifier.
-
Word Embeddings:
- Implemented using the Skip-gram model of Word2Vec.
A rule-based classifier uses a set of manually crafted rules to classify text data. This approach relies on domain knowledge and specific patterns identified in the text to make predictions.
The Bag of Words (BoW) model is a simple and commonly used method in natural language processing. It transforms text into a fixed-size vector by counting the frequency of each word in the document, ignoring grammar and word order but keeping multiplicity.
-
Preprocessing:
- Tokenization: Splitting the text into individual words.
- Lowercasing: Converting all text to lowercase to maintain uniformity.
- Removing stop words: Removing common words that do not contribute much to the meaning.
-
Rule-Based Classifier:
- Define rules based on the domain knowledge.
- Apply these rules to classify the text.
-
Bag of Words Classifier:
- Create a vocabulary of all unique words in the training dataset.
- Convert each document into a vector based on word frequency.
- Train a machine learning model (e.g., Naive Bayes, Logistic Regression) on these vectors.
Word2Vec is a popular technique to learn word embeddings, which are dense vector representations of words. The meaning of a word is determined by the context in which it occurs, and words with similar meanings have similar representations.
-
Continuous Bag of Words (CBOW):
- Predicts the center word from the surrounding context words.
-
Skip-gram:
- Predicts surrounding context words from the center word.
- Implemented in this project.
The Skip-gram model is designed to predict the context words for a given center word. It learns word representations by maximizing the probability of the context words given a center word.
-
Preprocessing:
- Tokenization: Splitting the text into individual words.
- Building the vocabulary: Creating a dictionary of all unique words in the dataset.
- Generating training examples: Creating pairs of center words and context words.
-
Training the Model:
- Define the neural network architecture for the Skip-gram model.
- Train the model using the generated training examples.
- Optimize the model parameters to learn the word embeddings.
- Python 3.x
- Libraries: numpy, pandas, scikit-learn, gensim, nltk
- Thanks to the developers of the libraries used in this project.
- Special thanks to the authors of the datasets used for training and evaluation.
Feel free to explore the code and experiment with different configurations to improve the models! If you encounter any issues or have suggestions, please open an issue or submit a pull request.
Happy coding!