This project focuses on detecting fraudulent transactions in the Goods and Services Tax (GST) system using advanced machine learning techniques. It addresses the critical issue of fraud detection in GST transactions, which is crucial for maintaining the integrity of the tax system and preventing revenue losses.
Fraud detection in GST is a significant challenge due to the increasing incidence of fraudulent activities, including fake invoices, input tax credit (ITC) fraud, and underreporting of sales. Traditional methods often fall short in handling large datasets and recognizing nuanced patterns indicative of fraudulent behavior. This project leverages machine learning techniques to enhance fraud detection capabilities.
The dataset consists of approx. 9 lakhs Row and 21 column, including the target variable indicating fraudulent or non-fraudulent transactions. A major challenge is the class imbalance within the dataset, where fraudulent cases constitute a very small percentage of the total transactions. This imbalance complicates the prediction of fraudulent activities, as standard classification algorithms may perform poorly on underrepresented classes.
- Construct a predictive model capable of accurately detecting fraudulent GST transactions.
- Implement techniques that address the imbalance in the dataset using a custom mathematical approach.
- Evaluate model performance using metrics tailored for imbalanced data.
- Data cleaning and handling of missing values
- Feature engineering
- Encoding of categorical variables
- Oversampling techniques (e.g., SMOTE)
- Undersampling of the majority class
- Cost-sensitive learning
- Algorithms evaluated: Logistic Regression, Decision Trees, Random Forest, XGBoost, LightGBM
- Stratified k-fold cross-validation
- Precision
- Recall
- F1-score
- ROC-AUC
The project compared various models and techniques. Here are the detailed performance metrics for different approaches:
Model/Approach | Accuracy (%) | Precision (%) | Recall (%) | F1-Score (%) | ROC-AUC (%) |
---|---|---|---|---|---|
Base Models | |||||
XGBoost | 98 | 85 | 94 | 89 | 99 |
Gaussian Naive Bayes | 96 | 74 | 84 | 79 | 96 |
Under-Sampling | |||||
XGBoost (Under-Sampling) | 97 | 76 | 100 | 86 | 99 |
Gaussian Naive Bayes (Under-Sampling) | 96 | 75 | 79 | 77 | 96 |
Over-Sampling | |||||
XGBoost (Over-Sampling) | 97 | 78 | 99 | 87 | 99 |
Gaussian Naive Bayes (Over-Sampling) | 96 | 73 | 89 | 80 | 96 |
SMOTE | |||||
XGBoost (SMOTE) | 98 | 83 | 95 | 89 | 99 |
Gaussian Naive Bayes (SMOTE) | 92 | 54 | 88 | 67 | 94 |
Key findings include:
- XGBoost consistently outperformed other algorithms across various methodologies.
- The base XGBoost model achieved high accuracy (98%) and ROC-AUC (99%).
- Undersampling techniques significantly improved recall, with XGBoost reaching 100% recall.
- SMOTE improved the balance between precision and recall for XGBoost, maintaining high performance across all metrics.
The study successfully developed a predictive model for fraud detection in GST transactions, effectively addressing the class imbalance issue. The implemented machine learning techniques, particularly XGBoost with resampling methods, demonstrated considerable promise in accurately identifying fraudulent activities. The use of resampling techniques like under-sampling and SMOTE proved effective in improving model performance, particularly in terms of recall and F1-score.
- Explore advanced techniques such as deep learning and ensemble methods
- Integrate real-time data and feedback loops for continuous model improvement
- Adapt the model to evolving fraud patterns
- Further optimize the best-performing models (XGBoost) for deployment in real-world scenarios
To provide a comprehensive view of model performance, we utilized PyCaret AutoML for model comparison. The following image illustrates the comparative performance of various models:
Figure 1: Model comparison chart generated by PyCaret AutoML, showing the performance of different algorithms on key metrics.
This visualization helps in quickly identifying the top-performing models and understanding their relative strengths across different evaluation metrics.