This is coursework for the big data analytics class, that identifies whether a website is legitimate or a phishing site using random forest.
The aim was to learn as much as possible about supervised machine learning, and in the end to create a jupyter notebook on the topic of our choice (phishing detection in my case).
The coursework is on a jupyter notebook (coursework_phishing_website_detection.ipynb
) which is 100% reproducible and explains my thinking step by step.
There are several stages in this coursework:
- Research & Data Exploration
- Dataset presentation
- Related Work & Data Exploration
- Data Pre-processing
- Modelling/ Classification
- Solution Improvement
Key words:
- Random Forest Classification
- Gradient Boosted Trees
- Cross-validation
- Randomized Search
- Grid Search
- Fully Homomorphic Encryption Machine Learning
As a bonus, I decided to create a streamlit application to simulate a real-world implementation of an anti-phishing solution based on machine learning.
Note To run the streamlit app that allow you to determine if it's a phishing or legitimate website based on URL do the following command:
streamlit run phishing_website_detection_app.py