Skip to content

jasontanx/capstone-project-machine-learning

Repository files navigation

Capstone-Project-Machine-Learning - Tourism Domain

This final semester project is carried out as part of the coruse (MSc in Data Science and Business Analytics) requirement I enrolled.

Topic: Building a Novel Predictive Model to Predict Tourist Travel Preferences for Effective Planning of Domestic Tour Packages

Presentation Deck: Click Here

Steps Sections Invovled Tools Used Main Packages Involved
1 Initial Data Exploration Python - Google Colab N/A
2 Exploratory Data Analysis R Programming ggplot2 & dplyr
3 Data Pre-Processing Python - Google Colab Numpy, Pandas & Sklearn (LabelEncoder & OneHotEncoder)
4 Modelling (Clustering) Python - Google Colab KModes & Matplotlib
5 Feature Selection R Programming Boruta
6 Modelling (Classification) Python - Google Colab Sklearn (Model Selection, LogisticRegression, DecisionTreeClassifier, MLPClassifier, RandomForestClassifier)
7 Evaluation Python - Google Colab Sklearn (Classification report & Confusion matrix)
8 Deployment Python - Google Colab Streamlit, Pickle & Pyngrok / Ngrok

Introduction

Why do I select this topic as my final semester capstone project?

  • Tourism industry’s crucial contribution to Malaysia's Gross Domestic Product ✈️
  • Industry badly affected due to the Covid-19 pandemic 😷
  • Post Covid-19 recovery on path for the travel industry 📈
  • Huge potential in utilising Machine Learning to attract tourists (and for the whole industry) 🤖

What is the problem statement of the project?

  • The lack of a predictive model and focus on identifying tourist preferences has led to inefficient planning of domestic tour packages by Malaysian tourist operators

Research Questions / Answers I am finding 🌟

In addressing the issues associated with the design and scheduling of the tour packages, a few questions were developed:

  • How to effectively cluster the collected data into several clusters for classification?
  • What are the predictive modelling approaches that could effectively provide an accurate prediction of tourists’ clusters for efficient planning of domestic tour packages?
  • What are the valid recommendations that could be provided to the relevant authorities to enhance the scheduling of tour package?

Aims & Objectives

The aim of this project is to develop a novel data mining solution to accurately predict tourist travel preferences for the scheduling of domestic tour packages in Malaysia.

With this, the 3 objectives of the project are listed below:

  • To develop a clustering model to effectively cluster the collected data into respective clusters for classification purposes.
  • To develop data mining models using predictive modelling approaches to predict tourist travel clusters for efficient planning of domestic tour packages.
  • To draft relevant and valid recommendations for the relevant authorities.

Methodology

What methodology was used to carry out the project?

  • “CRoss-Industry Standard Process for Data Mining” or CRISP-DM methodology
  • Frequently used for data science projects and is the standard data mining methodology used to obtain useful information from the dataset
  • 6 stages invovled in CRISP-DM

The data was collected through a questionnaire survey with the All Questioned Asked

Project Implementation

Data Understanding

  • Initial Data Exploration Repo: Click Here
  • Exploratory Data Analysis (EDA) Repo: Click here
    • Univariate Analysis
    • Bivariate Analysis

Extra Note: What is EDA❓

  1. What questions are we trying to solve/prove?
  2. What kind of data do we have and how do we treat different types?
  3. What's missing from the data and how do we deal with it?
  4. Where are the outliers and why should we care about them?
  5. How can we add, change or remove features to get more out of our data?

Data Preparation & Clustering (Phase 1)

  • Data Pre-Processing & Clustering Repo: Click Here
    • Level Combination (Combining the levels in categorical variables that had many levels)
    • E.g. The “age” variable initially had a total of 4 categories. However, the last two categories only account for less than 5 observations. As such, “35-49 years old” group and “50 and above” group were combined with the “26 - 34 years old” group)
    • Unsupervised Learning: K-Modes Clustering

Data Preparation & Modelling (Phase 2)

  • Data Pre-Processing & Modelling Repo: Click Here
    • Feature Selection: Boruta Algorithm (Finding the answer of...which variables does not play a significant role in predicting the dependent variable?)
    • Label encoding and One-hot encoding
    • Logistic Regression / Decision Tree / Artificial Neural Network (ANN) / Random Forest

Deployment

  • Model Deployment Repo: Click Here
  • Sample web application could be viewed below

Sample temporary web application UI - Sample 1

git_3_model_deploy

Web application UI prediction (Proof of Concept) - Sample 2

git_4_model_deploy

Conclusion

Aims - Accomplished

  • K-Modes clustering model was successfully developed with the data collected from a questionnaire survey.

Objectives - Accomplished

  • A total of 4 models were developed (LR, DT, ANN and RF)
  • Suggestions on future model iterations provided & Potential collaborations with relevant stakeholders provided
  • Eventually, ANN was selected as the final model to be deployed due to its high prediction accuracy and better evaluation metrics as compared with the others.

git_2_model_eval

Project Overview

git_1

About

A final semester project from my MSc Data Science course

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published