Machine learning project to predict spam emails using keywords from the spambase dataset from UCI Dataset: https://archive.ics.uci.edu/ml/datasets/spambase
One application of these results can be in company promotional e-mails, in order to avoid being misclassfied as spam. Using some of the concepts in machine learning I apply to this repository, a company with this dataset could identify the important features in classifying an e-mail as spam, and consider that when crafting the advertisements and emails sent to their customer base.
- Load and split dataset at random into 80% training and 20% testing
- Create a bootstrap n=1000 and vary the size of columns (p), and choose the value that results in the lowest cross-validation error
- After determing p # of columns, I will train a decision tree classifier on the bootstrap sample
- Repeat these calculations to generate T ∈ {1, 50, 100, 150, 200, 300, 400} trees and evaluate on the training set.
- Determine train and test error, F1 score, and AUC by varying T in the range {1, 50, 100, 150, 200, 300, 400}
- Train a Random Forest algorithm with 10, 50, and 100 decision trees and report similar metrics on both the training and testing sets
- Report/visualize the top 10 features having the most influence on the model