Thai Hoc Nguyen*, Xuan Thi Tran* (*equal contribution)
| ICTA2024 | PDF | BibTex | Datasets |
The focus of many companies is to provide the best products and services to attract attention in the market.Each customer has different preferences due to variations in age, gender, and other personal factors. Purchasing behavior is a significant indicator that helps determine customer's preferences. To achieve this, they must find the way to classify customers with similarities into segments. Customer segmentation based on their direct or indirect interaction behavior with the company can be challenging due to the difficulty in selecting key features that highlight the interactions.
RFM model that refers to the three key features of Recency, Frequency, and Monetary value has been considered as an effective technique to expose valuable insights of customers' behaviors. Some studies have addressed that applying the K-means algorithm combined with the RFM model can be a promisin solution for customer segmentation.
With the continuous growth of generated data, it is crucial to deploy a machine learning based segmenting model in a Big data system. Hadoop and Spark are among best Big data storage and processing technologies. In this study, we propose an automatic, engaged machine learning based customer segmentation solution developed by Spark application framework while costumer data are stored in the HDFS storage.
First, you need to install Hadoop and Spark tools. Follow the installation instructions below:
Create virtual environments to ensure that libraries between applications do not conflict.You can create virtual environments anywhere you want. Using python
for Window or python3
for Linux.
$ python3 -m venv demo-project
$ cd demo-project
$ source bin/activate
Download repo from github to local using command:
$ git clone https://github.com/nthaihoc/segmentation-customer-hadoop-spark-mlops-icta-2024.git
You need to install the necessary libraries to manage and run the application. Using pip
for Window or pip3
for Linux.
$ cd segmentation-customer-hadoop-spark-mlops-icta-2024
$ pip3 install -r requirements.txt
There are some important files as artifacts
, src
and dvc.yaml
.
artifact
include model and results filesrc
include source code of applicationdvc.yaml
is a configuration file, supporting automatic command line execution, for building and managing pipelines
See more infomation about dvc
After successfully installing all the above steps, run the following command to start testing the application.
$ dvc repro
For any feedback or comments, please feel free to contact me through the following information:
@INPROCEEDINGS{XuanThiTran2018,
author={Xuan Thi Tran, Thai Hoc Nguyen},
booktitle={The 3rd International Conference on Advances in Information and Communication Technology (ICTA2024)},
title={An Automatic Machine Learning based Customer Segmentation Model with RFM Analysis},
year={2024},
volume={},
number={},
pages={},
keywords={Machine Learning, RFM model, K-means Clustering},
doi={}}