Skip to content

The model is trained with a set of emails labelled as either from Spam or Not Spam. There are 702 emails equally divided into spam and non spam category. Next, we shall test the model on 260 emails. The model is to predict the category of this emails and compare the accuracy with correct classification that we already know.

Notifications You must be signed in to change notification settings

jisilvia/Naive_Bayes_Spam_Mail_Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Spam eMail Detection using Naive Bayes Classification Algorithm

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve higher accuracy levels.

Project Description

In this project, a model is trained with set of emails labelled as either from Spam or Non-Spam. There are 702 emails equally divided into spam and non spam category. Next, the model is tested on 260 emails. The model is tasked to predict the category of the emails and compare the accuracy with known correct classifications. There are two folders: test-mails and train-mails. Train-mails are to train the model. Test-mails are used to test the accuracy of the model. Each email's first line is the subject; the content starts from the third line.

Steps

  1. Cleaning and Preparing Data
  2. Building the Algorithms
  3. Training and Predicting Results
  4. Evaluation

Requirements

Python. Python is an interpreted, high-level and general-purpose programming language.

Google Colab. Google colab is a free online Integrated Data Environment.

Packages

Install the following packages in Python prior to running the code.

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from google.colab import drive
drive.mount('/content/drive')

After importing drive.mount('/content/drive'), follow instructions in the output to authorize access to Google Drive in order to obtain directories.

Launch

Download the data file provided and decompress it. Using Google Drive, create the following folder structure and upload the data here:

/content/drive/MyDrive/MSBA_Colab_2020/ML_Algorithms/CA02/Data

where /content/drive/MyDrive is the standard file path.

Known Bugs

Please download the .ipynb file and open it in Google Collab to correctly display the markup comments.

Authors

Silvia Ji - GitHub

License

This project is licensed under the MIT License.

Acknowledgements

The project template and dataset were provided by Arin Brahma at Loyola Marymount University.

About

The model is trained with a set of emails labelled as either from Spam or Not Spam. There are 702 emails equally divided into spam and non spam category. Next, we shall test the model on 260 emails. The model is to predict the category of this emails and compare the accuracy with correct classification that we already know.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published