This is a project that was done for the Skill4U machine Learning program. This project is a Spam email classifier using machine learning. This model uses Gaussian NB algorithm to train the model.
Spamming is one of the major and common attacks that accumulate a large number of compromised machines by sending unwanted messages, viruses, and phishing through email. We have chosen this project because now there are many people who are trying to fool you just by sending you fake e-mails
In recent figures, 40% of all mail is spam that emails about 15.4 billion emails per day and costs Internet users about $ 355 million per year. Automatic e-mail filtering is the most effective way to deal with spam at the moment
The proposed solution for this problem is to use Gaussian Naïve Bayes classifier, we have two classes to classify in either spam or ham emails. GaussianNB assumes that the data from each label is drawn from a simple Gaussian distribution. The Scikit-learn Library helps us to implement the Gaussian Naïve Bayes algorithm for classification.
We have proposed the following technique in order to classify emails
The Dataset used to train our model was taken from Kaggle. https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset
- This dataset contains 3 csv files each file contains 2 columns.
- The first column is the body of the email
- The second column contains our labels 0 for Not Spam 1 for Spam
- Total values of the dataset of all 3 files is 18650
We cleaned the data using NLTK library for python and vanilla python functions.
- We balanced our dataset
- Combined our 3 csv files into 1 dataset
- Removed links from the dataset body column
- Removed unnecessary symbols from our body column
- Changed all the text into lower case
- Performed word Tokenization
- Used Lemmatization to remove different forms of the same words
- Removed Stop words from our data
- Vectorized our data By bag of words method
Algorithm comparison graph | Details |
---|---|
We are using Gaussian NB algorithm for classification. We tested out different classification algorithms and GaussianNB was giving the best results on the test data |
ROC Curve | Model Evaluation |
---|---|
After training and finding the best parameters we were able to get 90.07 % accuracy on our Test data |
Confusion Matrix | Classification Report |
---|---|
About 14.5 billion spam email messages are circulated daily. That is almost 45 percent of the regular email traffic in the world. Internet Service Providers (ISPs) use spam filters to ensure they do not deliver corrupt incoming emails or links to the receiver.
On the left you can see how this model works. You can also try it out by scanning the QR code down below
Demo | Scan to see yourself |
---|---|