The goal of this project is to analyze and predict the fraudulence of healthcare providers in the well-known Kaggle data set, linked above.
Classification analysis using this methodology can be particularly of use to Health Insurance companies as well as to public health advocates. With the proper approach, an assessor can ascertain not only which providers are engaged in fraudulent activities, but also avoid erroneously classifying companies as fraudulent, thus - in the case of insurance companies - saving a great deal of money in the process.
In order to achieve this goal, we will dig deeply into the data and apply a variety of Machine Learning Classification techniques, including such classic models as Logistic Regression and Support Vector Classification as well as involving more modern models, such as CatBoost or LightGBM.
Here you will find 3 notebooks of particular note: I. Data Exploration, II. Data Preparation, and III. Machine Learning Processing. Considering the length of the project, we found it most expedient to separate the three approaches into separate notebooks for easier viewing.
Over the course of the project, we incorporated a variety of standard tools and techniques including Pandas, Numpy, Seaborne, and Matplotlib. Of further note are SKLearn's StandardScaler, PCA, LogisticRegression, KNeighborsClassifier, LinearDiscriminantAnalysis, GaussianNaiveBayes, and GridSearchCV. We also used SVM, CatBoostClassifier, LGBMClassifier. The very end of our project culminated with successful implementation of stacking techniques.
For further discussion of the project, its process, and the full analysis, please consult the blog which - at the time of this writing - is yet forthcoming. Should you have any other questions, please feel free to reach out to either of us.