GitHub - sanjaykrishnagouda/Anomaly-Detection: Easy to understand classification problem from a highly skewed kaggle dataset. Solved using logistic regression and SVM, code inspired from top contributor.

sanjaykrishnagouda / Anomaly-Detection Public

Notifications You must be signed in to change notification settings
Fork 0
Star 5

Easy to understand classification problem from a highly skewed kaggle dataset. Solved using logistic regression and SVM, code inspired from top contributor.

www.kaggle.com/dalpozz/creditcardfraud

5 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Logistic Regression complete dataset.png		Logistic Regression complete dataset.png
Logistic Regression.png		Logistic Regression.png
Readme		Readme
SVM kernels run 2.png		SVM kernels run 2.png
SVM kernels.png		SVM kernels.png
cc.py		cc.py
cc_2.py		cc_2.py
complete data svm vs lr r2.png		complete data svm vs lr r2.png
complete data svm vs lr.png		complete data svm vs lr.png
svm on complete data.png		svm on complete data.png
undersampled data svm vs lr.png		undersampled data svm vs lr.png

Repository files navigation

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

Using Logistic Regression and SVM (with RBF kernel) on a highly skewed data set.

Download the data set from here:
https://www.kaggle.com/dalpozz/creditcardfraud

Ideas:
______

-->Dropped the time column from the original dataset and also normalized the amount column. (Done)

-->Can further analyse each column (feature) to see if we can discard other features.

-->Can do PCA to project complete data in a meaningful way. Work in progress

-->Use Neural Networks (with more than 1 hidden layers) for classification.

-->Show a comparitive analysis of using different parameters and types of kernels in SVM function calls.
This may not yield much different results.

Dependencies:
_____________

python 3.5
scikit-learn
pandas
numpy (optional)
tensorflow (comment out, works fine without. Idea is to use tensorflow in future.)
matplotlib

How to run:

call python cc.py (or python 3, however your env for python 3 is configured)
You can call logistic regression/ SVM seperately using cc_2.py by commenting out appropirate lines.
Use function printing_Kfold_cross for Log Reg and
diff_kernels() for SVM with different kernels (all kernels will run)

pass complete dataset (temp[0],temp[1]) or the undersampled one(temp[2],temp[3]) as parameters to the function.
SVM runs for really long time on the complete dataset. Adding the graphs for this run.

Note:
Please ignore the function using_SVM, it needs proper structuring and further analysis.
Including graphs for recall on complete data and undersampled data comparing the performance of SVM and Logistic Regression.

Questions to explore:
Use of L1/L2 loss function needs further analysis. (Used L2 here, maybe bfgs is faster.)
Does SVM perform worse on skewed data?
Is recall a good metric to analyse the performance? It really depends on the data:
If false positives are much less tolerable than false negatives. (You do not have cancer but it predicts that you do.)
(or) the other way around.

Some results :

(C:\ProgramData\Anaconda3) C:\Users\gsk69>m:

(C:\ProgramData\Anaconda3) M:\>cd "Course stuff\Machine Learning\projects\Data Sets\credit card"

(C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>python cc_2.py
USING Logistic Regression::
Best Mean: 0.612260 with inverse regularization strength 10.000000
USING SVM :
Best Mean: 0.684470 with inverse regularization strength 10.0000 using rbf kernel
USING SVM :
Best Mean: 0.754652 with inverse regularization strength 1.0000 using poly kernel
USING SVM :
Best Mean: 0.275240 with inverse regularization strength 1.0000 using sigmoid kernel

(C:\ProgramData\Anaconda3) M:\Course stuff\Machine Learning\projects\Data Sets\credit card>
_______________________________________________________________________________________________________________
_______________________________________________________________________________________________________________
currently running rbf kernel
USING SVM :
Best Mean: 0.607521 with inverse regularization strength 1.2800 using rbf kernel
SVM with rbf took 6244.27 seconds for completion.
currently running linear kernel
USING SVM :
Best Mean: 0.782996 with inverse regularization strength 0.0200 using linear kernel
SVM with linear took 1137.45 seconds for completion.
currently running sigmoid kernel
USING SVM :
Best Mean: 0.274755 with inverse regularization strength 0.1600 using sigmoid kernel
SVM with sigmoid took 1676.50 seconds for completion.
currently running poly kernel
USING SVM :
Best Mean: 0.711231 with inverse regularization strength 0.1600 using poly kernel
SVM with poly took 1000.95 seconds for completion.

____________________________________________________________________________________

Run details :
Used BaggingClassifier with default values and base classifier = SVM (SVC/LinearSVC)
All kernels with max_iter set to 1000
Used LinearSVC for linear kernel

currently running rbf kernel
USING SVM :
Best Mean: 0.604330 with inverse regularization strength 1.2800 using rbf kernel
rbf took 3615.35 seconds for completion.
currently running linear kernel
USING SVM :
Best Mean: 0.727653 with inverse regularization strength 1.2800 using linear kernel
linear took 6100.82 seconds for completion.
currently running sigmoid kernel
USING SVM :
Best Mean: 0.274755 with inverse regularization strength 0.3200 using sigmoid kernel
sigmoid took 1713.92 seconds for completion.
currently running poly kernel
USING SVM :
Best Mean: 0.703166 with inverse regularization strength 0.3200 using poly kernel
poly took 1635.25 seconds for completion.

Total run time = 13065.34s = 3h 37m

About

Easy to understand classification problem from a highly skewed kaggle dataset. Solved using logistic regression and SVM, code inspired from top contributor.

www.kaggle.com/dalpozz/creditcardfraud