Identifying and distinguishing spam massages using the multinomial Naïve Bayes model.
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. At the time of writing this repository, there are 5 different types of Naive Bayes classifiers, which as follow:
- 1- Bernoulli Naive Bayes classifier
- 2- Categorical Naive Bayes classifier
- 3- Complement Naive Bayes classifier
- 4- Gaussian Naive Bayes classifier
- 5- multinomial Naive Bayes classifier
In this repository, we have used the multinomial Naive Bayes classifier to detect spam messages, the reason for using this classifier is the simple implementation, high accuracy, and vector implementation method of this model. It should be noted that other methods can also be used to detect spam messages, such as the Complement Naive Bayes classifier and Tf-Idf.
MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors θ y = ( θ y 1 , … , θ y n ) for each class y where n is the number of features (in text classification, the size of the vocabulary) and θ y i is the probability P ( x i ∣ y ) of feature i appearing in a sample belonging to class y
The parameters θ y is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
θ ^ y i = N y i + α / N y + α nwhere N y i = ∑ x ∈ T x i is the number of times feature i appears in a sample of class y in the training set T and N y = ∑ i = 1 n N y i is the total count of all features for class y
I used the smsSpamCollection dataset to train my model, which can be accessed via the link below: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
The accuracy of our Naïve Bayes multinomial model is 99.01345291479821 % The Precision of our Naïve Bayes multinomial model is 97.88732394366197 % The Recall of our Naïve Bayes multinomial model is 94.5578231292517 %
We can use the confusion matrix to observe the performance of our model:
- Import libraries
- Upload dataset
- Create the data frame
- Split the data
- Vectorize the data
- Train & predict
- calculate accuracy, precision, and recall
- calculate the confusion matrix
- Test the model with a new Sms/Email massage
More information is available in the Jupyter Notebook file