Our project will deal with the issue of spam messages. Spam text messages have become very common, so our purpose is to filter them out for the benefit of the user. Our implementation will use classification to determine whether or not a given message is “spam” or “not spam”. We will try using classification through regression, SVM, (TBD on neural networks). We will show how these algorithms are used in context of NLP and spam message filtering, discuss what parameters exist and how to optimize them, and then at the end compare the optimized models’ prediction accuracies and show which model is best. Using our knowledge from class, we will discuss our observations and why they make sense (or why they don’t).
- K-nearest neighbors
- Logistic Regression
- SVM
- Read reading list material (4/19/2017)
- Choose implementation (4/20/2017)
- Implement implementation (copying a strong online one and modifying to fit our data) (4/28/2017)
- Write analysis of implementation (5/3/2017)
- Sebastiani, Fabrizio. "Machine learning in automated text categorization." ACM computing surveys (CSUR) 34.1 (2002): 1-47.
- Guzella, Thiago S., and Walmir M. Caminhas. "A review of machine learning approaches to spam filtering." Expert Systems with Applications 36.7 (2009): 10206-10222.
- Blanzieri, Enrico, and Anton Bryl. "A survey of learning-based techniques of email spam filtering." Artificial Intelligence Review 29.1 (2008): 63-92.
https://web.stanford.edu/class/cs124/lec/naivebayes.pdf
Extremely short summary of Naive Bayes: https://stats.stackexchange.com/questions/91177/machine-learning-techniques-for-spam-detection-and-in-general-for-text-classifi
Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset
Tensorflow Email Phishing https://jrmeyer.github.io/tutorial/2016/02/01/TensorFlow-Tutorial.html