Skip to content

An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024

License

Notifications You must be signed in to change notification settings

nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024

Repository files navigation

ICTA 2024: An Automatic Machine Learning based Customer Segmentation Model with RFM Analysis

icta

Thai Hoc Nguyen*, Xuan Thi Tran* (*equal contribution)

| ICTA2024 | PDF | BibTex | Datasets |

 

Introduction

The focus of many companies is to provide the best products and services to attract attention in the market.Each customer has different preferences due to variations in age, gender, and other personal factors. Purchasing behavior is a significant indicator that helps determine customer's preferences. To achieve this, they must find the way to classify customers with similarities into segments. Customer segmentation based on their direct or indirect interaction behavior with the company can be challenging due to the difficulty in selecting key features that highlight the interactions.

RFM model that refers to the three key features of Recency, Frequency, and Monetary value has been considered as an effective technique to expose valuable insights of customers' behaviors. Some studies have addressed that applying the K-means algorithm combined with the RFM model can be a promisin solution for customer segmentation.

With the continuous growth of generated data, it is crucial to deploy a machine learning based segmenting model in a Big data system. Hadoop and Spark are among best Big data storage and processing technologies. In this study, we propose an automatic, engaged machine learning based customer segmentation solution developed by Spark application framework while costumer data are stored in the HDFS storage.

 

Environment Setup

Install Hadoop and Spark

First, you need to install Hadoop and Spark tools. Follow the installation instructions below:

Create environment

Create virtual environments to ensure that libraries between applications do not conflict.You can create virtual environments anywhere you want. Using python for Window or python3 for Linux.

$ python3 -m venv demo-project
$ cd demo-project
$ source bin/activate

Download Source Code

Download repo from github to local using command:

$ git clone https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024.git

Install Library Dependencies

You need to install the necessary libraries to manage and run the application. Using pip for Window or pip3 for Linux.

$ cd segmentation-customer-hadoop-spark-mlops-icta-2024
$ pip3 install -r requirements.txt

Folder Structure

There are some important files as artifacts, src and dvc.yaml.

  • artifact include model and results file
  • src include source code of application
  • dvc.yaml is a configuration file, supporting automatic command line execution, for building and managing pipelines

See more infomation about dvc

folder structure

PipeLine Start

After successfully installing all the above steps, run the following command to start testing the application.

$ dvc repro

 

Contributing

For any feedback or comments, please feel free to contact me through the following information:

| email_01 | email_02 |

 

Citation

@INPROCEEDINGS{XuanThiTran2018,
  author={Xuan Thi Tran, Thai Hoc Nguyen},
  booktitle={The 3rd International Conference on Advances in Information and Communication Technology (ICTA2024)}, 
  title={An Automatic Machine Learning based Customer Segmentation Model with RFM Analysis}, 
  year={2024},
  volume={},
  number={},
  pages={},
  keywords={Machine Learning, RFM model, K-means Clustering},
  doi={}}

About

An automatic machine learning based customer segmentation model with RFM analysis at ICTA conference 2024

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages