This is the final project for a Big Data course (from the Master 2 SISE program at the Université Lumière Lyon 2) headed by Guillaume METZLER. The aim of this project was to detect and predict fraud given certain features and using machine learning algorithms.
We had over 11 million actual transactions from Fichier National des Chèques Irréguliers (FNCI) and the Banque de France.
The original project can be found here in French.
Fraud detection is a challenge in machine learning due to the imbalance of classes (fraud vs. non-fraud). We aim to create effective predictive models using appropriate algorithms. We are investigating resampling techniques such as SMOTEEN and Tomek Link before running several machine learning algorithms to analyse the data.
- Resampling techniques: SMOTEEN and Tomek Link algorithms to rebalance the classes and enhance the representation of frauds.
- Data analysis: Several machine learning algorithms, including Decision trees, random forests, basic artificial neural networks, autoencoder, XGBoost, balanced random forests, ensemble models, k-Means, logistic regression to detect and predict fraud given certain features.
- Models' effectiveness evaluation: Using F1-score, which is relevant in class imbalance problems.
NB: Only Tomek Link, k-Means, logistic regression, and autoencoder algorithms can be found on this repository. The other algorithms are available on the original repository.
The maximum value for the F1-score is about 0.06.
Fraud detection in a context of class imbalance problems remains a significant challenge in machine learning. This project thus highlights the importance of developing more advanced methods to improve the performance of models in such situations.
Adrien CASTEX, Célia MAURIN, Annabelle NARSAMA