The proposed system of the project will effectively detect the spam mails and the system will extract the spam mails by using some machine learning algorithms and it gives the result with greater accuracy and with good performance. It will save the user's time and it destroys the risk of spam mails.
Emails are the popular and preferred way of writing communication in our everyday life. The problem with emails is spam. Over the past decade, unsolicited bulk emails have become a major problem for email users. A huge amount of spam flows into users' mailboxes every day.
The increasing amount of spam emails day by day is causing many important emails to be lost in the sea of junk mail. To reduce this issue, we are implementing ways in which spam email can be differentiated from important emails.
By doing this we can reduce the time spent to look for an important email which in turn reduces the hassle associated with the process. The results we are expecting are to perform filtering in the most accurate way to differentiate the spam emails from the ham.
The main feature of our project is to determine if a received email is spam or ham. This feature will be very useful for students or working professionals who have to deal with emails every day. This project also aims in preventing phishing attempts by filtering the spam from ham emails.
A. Pre-processing
- Removal of Special Characters
- Removal of Numbers
- Lowercase Conversion
- Tokenization
- Removal of Stop words
- Stemming
B. Feature Extraction
- Bag of words
- Tf-Idf
C. Classification
- Naive Bayes Algorithm (in C++ also)
- Random Forest Classifier
- Support Vector Machine
- MLP Classifier
Note: The datasets that are created in our project has been uploaded here : Datasets