-
Notifications
You must be signed in to change notification settings - Fork 0
Easy to understand classification problem from a highly skewed kaggle dataset. Solved using logistic regression and SVM, code inspired from top contributor.
sanjaykrishnagouda/Anomaly-Detection
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Credit Card Fraud Detection Anonymized credit card transactions labeled as fraudulent or genuine Using Logistic Regression and SVM (with RBF kernel) on a highly skewed data set. Download the data set from here: https://www.kaggle.com/dalpozz/creditcardfraud Ideas: ______ -->Dropped the time column from the original dataset and also normalized the amount column. (Done) -->Can further analyse each column (feature) to see if we can discard other features. -->Can do PCA to project complete data in a meaningful way. Work in progress -->Use Neural Networks (with more than 1 hidden layers) for classification. -->Show a comparitive analysis of using different parameters and types of kernels in SVM function calls. This may not yield much different results. Dependencies: _____________ python 3.5 scikit-learn pandas numpy (optional) tensorflow (comment out, works fine without. Idea is to use tensorflow in future.) matplotlib How to run: call python cc.py (or python 3, however your env for python 3 is configured) You can call logistic regression/ SVM seperately using cc_2.py by commenting out appropirate lines. Use function printing_Kfold_cross for Log Reg and diff_kernels() for SVM with different kernels (all kernels will run) pass complete dataset (temp[0],temp[1]) or the undersampled one(temp[2],temp[3]) as parameters to the function. SVM runs for really long time on the complete dataset. Adding the graphs for this run. Note: Please ignore the function using_SVM, it needs proper structuring and further analysis. Including graphs for recall on complete data and undersampled data comparing the performance of SVM and Logistic Regression. Questions to explore: Use of L1/L2 loss function needs further analysis. (Used L2 here, maybe bfgs is faster.) Does SVM perform worse on skewed data? Is recall a good metric to analyse the performance? It really depends on the data: If false positives are much less tolerable than false negatives. (You do not have cancer but it predicts that you do.) (or) the other way around. Some results : (C:\ProgramData\Anaconda3) C:\Users\gsk69>m: (C:\ProgramData\Anaconda3) M:\>cd "Course stuff\Machine Learning\projects\Data Sets\credit card" (C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>python cc_2.py USING Logistic Regression:: Best Mean: 0.612260 with inverse regularization strength 10.000000 USING SVM : Best Mean: 0.684470 with inverse regularization strength 10.0000 using rbf kernel USING SVM : Best Mean: 0.754652 with inverse regularization strength 1.0000 using poly kernel USING SVM : Best Mean: 0.275240 with inverse regularization strength 1.0000 using sigmoid kernel (C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card> _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ currently running rbf kernel USING SVM : Best Mean: 0.607521 with inverse regularization strength 1.2800 using rbf kernel SVM with rbf took 6244.27 seconds for completion. currently running linear kernel USING SVM : Best Mean: 0.782996 with inverse regularization strength 0.0200 using linear kernel SVM with linear took 1137.45 seconds for completion. currently running sigmoid kernel USING SVM : Best Mean: 0.274755 with inverse regularization strength 0.1600 using sigmoid kernel SVM with sigmoid took 1676.50 seconds for completion. currently running poly kernel USING SVM : Best Mean: 0.711231 with inverse regularization strength 0.1600 using poly kernel SVM with poly took 1000.95 seconds for completion. ____________________________________________________________________________________ Run details : Used BaggingClassifier with default values and base classifier = SVM (SVC/LinearSVC) All kernels with max_iter set to 1000 Used LinearSVC for linear kernel currently running rbf kernel USING SVM : Best Mean: 0.604330 with inverse regularization strength 1.2800 using rbf kernel rbf took 3615.35 seconds for completion. currently running linear kernel USING SVM : Best Mean: 0.727653 with inverse regularization strength 1.2800 using linear kernel linear took 6100.82 seconds for completion. currently running sigmoid kernel USING SVM : Best Mean: 0.274755 with inverse regularization strength 0.3200 using sigmoid kernel sigmoid took 1713.92 seconds for completion. currently running poly kernel USING SVM : Best Mean: 0.703166 with inverse regularization strength 0.3200 using poly kernel poly took 1635.25 seconds for completion. Total run time = 13065.34s = 3h 37m
About
Easy to understand classification problem from a highly skewed kaggle dataset. Solved using logistic regression and SVM, code inspired from top contributor.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published