This project demonstrates how to classify emails as Spam or Ham (Not Spam) using Natural Language Processing (NLP) and a Random Forest Classifier.
- Preprocessing: Cleans and processes email text (removes punctuation, converts to lowercase, stems words, and removes stopwords).
- Vectorization: Converts text data into numerical format using CountVectorizer.
- Model Training: Uses a Random Forest Classifier for prediction.
- Prediction: Classifies new emails as Spam or Ham.
- Python 3.7 or higher
- Libraries:
numpy
pandas
nltk
scikit-learn
Install required libraries:
pip install numpy pandas nltk scikit-learn
The dataset used for this project:
- Columns:
text
: The email content.label_num
: The label (0 for Ham, 1 for Spam).
Replace 'spam_ham_dataset.csv'
with your dataset file.
-
Data Preprocessing:
- Converts text to lowercase.
- Removes punctuation.
- Applies stemming to reduce words to their root forms.
- Removes stopwords (e.g., "the", "is", "in").
-
Feature Extraction:
- Text is converted to a bag-of-words representation using
CountVectorizer
.
- Text is converted to a bag-of-words representation using
-
Model Training:
- Splits data into training and testing sets.
- Trains a Random Forest Classifier on the training data.
-
Email Prediction:
- Takes an example email, preprocesses it, and predicts if it's Spam or Ham.
- Load the dataset:
data = pd.read_csv('spam_ham_dataset.csv')
- Run the code to train the model and evaluate accuracy:
cl.score(X_test, y_test)
- Predict an email:
prediction = cl.predict(x_email) print(f"Prediction: {'Spam' if prediction[0] == 1 else 'Ham'}")
- Prints the model's prediction (Spam or Ham) for a sample email.
- Displays the actual label from the dataset for comparison.
- Ensure the dataset is in the correct format before running the notebook.
- The
nltk
library requires downloading stopwords:nltk.download('stopwords')
Feel free to use and modify this project for learning purposes.