- pandas
- numpy
- sklearn
- matplotlib
- seaborn
In this project we try to detect credit card fraud using Logistic Regression also we preprocessing the data.
Database used is Credit Card Fraud Detection from Kaggle
We start by loading the data into the jupyter notebook. After loading the data, we convert the data into a data frame using the pandas to make it more easier to handel.
After loading the data, we visualize the data. First we need to know how our data looks so we use dataframe.head()
to visualize the first 5 rows of the data also we need to know how our data is distributed so we plot our data.
Using dataframe.corr()
, we find the Pearson, Standard Correlation Coefficient matrix.
Since the data is highly Unbalanced
We need to undersample the data.
Why are we undersampling instead of oversampling?
We are undersampling the data because our data is highly unbalanced. The number of transactions which are not fradulent are labeled as 0 and the trancactions whoch are fradulent are labeled as 1.
The number of non fraudulent transactions are 284315 and the number of fradulent transactions are 492.
If we oversample our data so inclusion of almost 284000 dummy elements will surely affect our outcome by a huge margin and it will be hugely biased as non-fradulant so undersampling is a much better approach to get an optimal and desired outcome.
We create a user defined function for the confusion matrix or we can use confusion_matrix
from sklearn.matrics
library.
We train our model using LogisticRegression
from sklearn.linear_model
.
The syntax is as follows:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
pred = classifier.predict(X_train)
print(classifier.score(X_train,y_train))
We get accuracy of our training model more than 95% most of the time with random samples. The confusion matrix is as follows:
We find the Precision, Recall, F1-Score, Mean Absolute Error, Mean Percentage Error and Mean Squared Error using the following syntax -
from sklearn.metrics import classification_report,mean_absolute_error,mean_squared_error,r2_score
report= classification_report(y_train,pred)
print(report)
mean_abs_error = mean_absolute_error(y_train,pred)
mean_abs_percentage_error = np.mean(np.abs((y_train - pred) // y_train))
mse= mean_squared_error(y_train,pred)
r_squared_error = r2_score(y_train,pred)
print("Mean absolute error : {} \nMean Absolute Percentage error : {}\nMean Squared Error : {}\nR Squared Error {}".format(mean_abs_error,mean_abs_percentage_error,mse,r_squared_error))
To improve our performance, we use combination of undersampling and SMOTE on our dataset. Syntax is as follows:
from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X_train,y_train= oversample.fit_resample(X_train,y_train)
We apply logistic regression on our dataset as usual. After applying logistic regression in most of the cases we observe that in most of the cases our accuracy is improved. Confusion matrix is as follows -
To improve our accuracy further, we tune the hyper parameter. Syntax is as follows -
classifier_b = LogisticRegression(class_weight={0:0.6,1:0.4})
classifier_b.fit(X_train,y_train)
pred_b = classifier_b.predict(X_test_all)
print(classifier_b.score(X_test_all,y_test_all))
The confusion matrix of the Testing model is as follows: