The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) was subscribed or not.
Data set has 20 predictor variables (features) and around 41K rows.
The dataset is collected from UCI Repository - "https://archive.ics.uci.edu/ml/datasets/bank+marketing"
- age (numeric)
- job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- default: has credit in default? (categorical: 'no','yes','unknown')
- housing: has housing loan? (categorical: 'no','yes','unknown')
- loan: has personal loan? (categorical: 'no','yes','unknown')
- contact: contact communication type (categorical: 'cellular','telephone')
- month: last contact month of year (categorical: 'jan', 'feb', 'mar', …, 'nov', 'dec')
- dayofweek: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
- duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no')
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
- emp.var.rate: employment variation rate - quarterly indicator (numeric)
- cons.price.idx: consumer price index - monthly indicator (numeric)
- cons.conf.idx: consumer confidence index - monthly indicator (numeric)
- euribor3m: euribor 3 month rate - daily indicator (numeric)
- nr.employed: number of employees - quarterly indicator (numeric)
- y - has the client subscribed a term deposit? (binary: 'yes','no')
In this analysis we perform:
- Exploratory Data Analysis
- Univariate Analysis
- BiVariate Analysis
- Model Fitting and Treating Imbalanced Data
To treat the imbalance data so that there is no bias in modeling, we have used -
- RANDOM UNDRSAMPLING
- RANDOM OVERSAMPLING
- SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE)
We have implemented the following Machine Learning Models -
- Logistic Regression Model
- Decison Tree
- Random Forest
- Support Vector Machine
For evaluating the models we have used Precision, Recall and Accuracy.
Model | Data Type | Precision | Recall | Accuracy |
---|---|---|---|---|
Logistic Regression Model | Imbalanced Data | 0.68 | 0.42 | 90.88% |
Logistic Regression Model | Undersampled Data | 0.45 | 0.88 | 86.13% |
Logistic Regression Model | Oversampled Data | 0.42 | 0.87 | 85.63% |
Decison Tree | Imbalanced Data | 0.62 | 0.55 | 91.42% |
Decison Tree | SMOTE Data | 0.46 | 0.82 | 87.04% |
Random Forest | Imbalanced Data | 0.26 | 0.75 | 73.88% |
Random Forest | SMOTE Data | 0.31 | 0.69 | 79.67% |
Support Vector Machine | SMOTE Data | 0.96 | 0.86 | 91.16% |
For the given data, visualization of data, ways to treat imbalance in the data and best predictive model to determine the term deposit subscription was explored. From visualization, it can be derived that repeated campaign calls to customers within 20 days of previous call increases the subscription. After treating the imbalance in data, Decision Tree Model performed the best in terms of accuracy score of 91.42%.