Binomial classification algorithm to predict whether Windows devices contain malware
- The data and business problem associated with this project came from a Kaggle Contest from 2019
- Between train and test sets, the data contains 17.6 million observations
- Uses international user data across various Windows devices including phones, tablets, PCs, etc.
- Binomial classification using logistic regression, SVM, NB, RF, GB, and ensemble solutions combining multiple algorithms' predictions
- The highest performing logistic regression submission attained a ROC score of 0.67585, whereas my final logistic model within the private leaderboard) had a ROC of 0.67667
- This put the solution in a position to qualify for the prize winning money as 1st of almost 2500 teams of professional data scientists
- Given the performance limitations of the R Server (discussed in greater detail in the next section) the training and test pre-processing are done in separate scripts. The methdology and process is almost identical in each
- In the cleaning script's 2nd sections, the categorical features are converted into numeric in order by percentage of malware in the group. Each of these sections was commented out and should have been reusably created as a function, so please uncomment before running the cleaning script
- Many of the features in the dataset are inherently unintuative to someone without a working knowledge of computer hardware specifications. For a more complete data dictionary, please consult the dictionary I created and included in the repo, as the Kaggle dictionary does not explain several key features
- This script was written in the spring of my junior year of undergrad, and given the many performance limitations efficiency of server computing was prioritized over scripting efficiency. While I hope to revisit and revise the scripts in the future, there are many areas of improvement in terms of the code's elegence
- The data for this project is 5 GB in total, so large that attempting to open any of the individual files will crash either Excel or the RStudio environment
- After managing to read the data into RStudio, the massive amounts of memory required to model such big data (my initial attempts at neural network development for this problem hit the servers's 55 GB maximum) require a distributed system to process the data
- Since this solution was conceptualized in R, a R Server was utilized to model the data. The attached R Server Best Practices PDF compiles the insights created throughout this project that can provide a helpful launching point for other R developers
- If a R Server instance is not available, the desktop_subset CSV included can be helpful for those interested in tackling the data complexity/cleaning aspects of this problem
- There was a direct relationship betweeen the number of anti-virus (AV) products installed and the percentage of users with malware on their devices, presumed to be because users only need 1 AV product and the more that a user has installed the more likely they are not aware how they should be used
- Almost 80% of individuals whose device contained a touch screen that had not been configured (ExistsNotSet) had malware. This was similarly inferred to be because the user lacked the technological acumen to undersatand the full extent of capabilities of their devices
- Gamers (denoted by the Wdft_IsGamer boolean feature) as compared to other users had a statistically significantly lower infection rate. This is likely due to the technology knowledge required to be a consistant gamer, as well as the higher cost of gaming technology
- Download the data from Kaggle
- Read the data into RStudio using a R Server environment, consulting the best practices PDF if questions arise
- Follow the steps listed in
training_clean.R
, outputting the cleaned data as a separate data file - Conduct the same steps in following the
test_clean.R
script - Follow the steps listed in
modeling.R
- Submit predictions to Kaggle (the script is formatted to produce the output in the style needed for Kaggle's scoring algorithm, but a sample submission is also available on the competition page for reference)