microsoft-malware

Binomial classification algorithm to predict whether Windows devices contain malware

About The Project

The data and business problem associated with this project came from a Kaggle Contest from 2019
Between train and test sets, the data contains 17.6 million observations
Uses international user data across various Windows devices including phones, tablets, PCs, etc.
Binomial classification using logistic regression, SVM, NB, RF, GB, and ensemble solutions combining multiple algorithms' predictions
The highest performing logistic regression submission attained a ROC score of 0.67585, whereas my final logistic model within the private leaderboard) had a ROC of 0.67667
This put the solution in a position to qualify for the prize winning money as 1st of almost 2500 teams of professional data scientists

Given the performance limitations of the R Server (discussed in greater detail in the next section) the training and test pre-processing are done in separate scripts. The methdology and process is almost identical in each
In the cleaning script's 2nd sections, the categorical features are converted into numeric in order by percentage of malware in the group. Each of these sections was commented out and should have been reusably created as a function, so please uncomment before running the cleaning script
Many of the features in the dataset are inherently unintuative to someone without a working knowledge of computer hardware specifications. For a more complete data dictionary, please consult the dictionary I created and included in the repo, as the Kaggle dictionary does not explain several key features
This script was written in the spring of my junior year of undergrad, and given the many performance limitations efficiency of server computing was prioritized over scripting efficiency. While I hope to revisit and revise the scripts in the future, there are many areas of improvement in terms of the code's elegence

The data for this project is 5 GB in total, so large that attempting to open any of the individual files will crash either Excel or the RStudio environment
After managing to read the data into RStudio, the massive amounts of memory required to model such big data (my initial attempts at neural network development for this problem hit the servers's 55 GB maximum) require a distributed system to process the data
Since this solution was conceptualized in R, a R Server was utilized to model the data. The attached R Server Best Practices PDF compiles the insights created throughout this project that can provide a helpful launching point for other R developers
If a R Server instance is not available, the desktop_subset CSV included can be helpful for those interested in tackling the data complexity/cleaning aspects of this problem

There was a direct relationship betweeen the number of anti-virus (AV) products installed and the percentage of users with malware on their devices, presumed to be because users only need 1 AV product and the more that a user has installed the more likely they are not aware how they should be used
Almost 80% of individuals whose device contained a touch screen that had not been configured (ExistsNotSet) had malware. This was similarly inferred to be because the user lacked the technological acumen to undersatand the full extent of capabilities of their devices
Gamers (denoted by the Wdft_IsGamer boolean feature) as compared to other users had a statistically significantly lower infection rate. This is likely due to the technology knowledge required to be a consistant gamer, as well as the higher cost of gaming technology

Download the data from Kaggle
Read the data into RStudio using a R Server environment, consulting the best practices PDF if questions arise
Follow the steps listed in training_clean.R, outputting the cleaned data as a separate data file
Conduct the same steps in following the test_clean.R script
Follow the steps listed in modeling.R
Submit predictions to Kaggle (the script is formatted to produce the output in the style needed for Kaggle's scoring algorithm, but a sample submission is also available on the competition page for reference)