Cross-selling is one of the most successful techniques of marketing in the modern days where a company aims at selling additional products/services among existing customers. In 2000, a Europe insurance company that offered various insurance services including life, auto, boat insurances to a large customer faced this challenge of cross-selling where the company’s newest service ‘Caravan insurance policy’ turned to be disappointing in terms of sales. The marketing department of the company knew that taking advantage of the existing customer base would improve their new insurance’s sale, however, the biggest question is ‘whom to target, among the company’s thousands of customers’. I attempt to answer this question by my fast part of the analysis. However, numerous efforts and solutions are already in place for answering this question, I tend to focus more on my second part of the analysis, which is devising a ‘go to market strategy’. As per the current situation the company has to approach all 4000 customers with the policy. If they approach all the customers they have to divide the marketing budget between of them, effectively reducing the discounts they can offer to individual customers leading to lower conversion rate. The central idea behind their target marketing being that the penetration price pricing directly influences the conversion rate. The company wants to spend 10% per unit of revenue to cross selling (marketing plus penetration pricing) and achieve maximum profit by balancing cost and target numbers. There are two go to marketing strategies that COIL can use. The first being to target a very narrow set of customers with high penetration pricing to have a very high conversion rate. The second is where the company markets to a wider consumer base with a lower penetration pricing relying to law of large numbers.
The dataset that was obtained consists of 86 features, which includes insurance product usage data and social-demographic data. The first 43 attributes are demographic and social data, whereas, the remaining 43 variables are insurance product usage related data which indicate customers of the company’s existing policies such as fire, boat, life, etc. The variable of interest in this dataset is Number_of_mobile_home_policies, which indicates the observations that have bought caravan insurance. Since, this dataset was used for the purposes of a challenge, I obtained the data in the form of training data and test data, which is why, there was no need to split the data for my analysis. The training data has 5893 observations, whereas, the test data consists of the remaining 3929 observations. We combined the training and test dataset for my initial data exploration and visualization, however, for fitting my models, I used the given training data and evaluated the performance measures on the given test data. This dataset is not set up as individual customer observations and each row represents a group of customers i.e., a large sample size. This might have been done to utilize all the observations and at the same time, keep the number of rows in the dataset to be manageable. Due to large number of features, it is infeasible to show the data dictionary or a data sample in this document, however, the data dictionary can be obtained from - http://kdd.ics.uci.edu/databases/tic/dictionary.txt and the complete dataset can be obtained from - http://kdd.ics.uci.edu/databases/tic/tic.html. To get an understanding of the features and data types associated with these features, I have included summary of the dataset and sample of the dataset in my Jupyter notebook document.
- Which existing customers also tend to buy the ‘caravan mobile home insurance’ policy?
- Considering the nature of decisions made on this data, I can maximize profit by recommending one of the two market strategies. i.e., what ‘go to market’ strategies could be used in order to maximize profits?
For taking advantage of different classification algorithms and improving performance measures of my classification, I used multiple classification algorithms including Logistic Regression, K-NN classification and Naïve Bayes Classification. We also used Ensemble methods including Bagging, Boosting and Random Forest for improving on single tree classifier models. For my first part of the analysis, I used Data Visualization and Association Rules to understand the characteristics of ‘caravan mobile home insurance’ buyers. The results from these allowed us to state the relationship between existing customers and ‘caravan mobile home insurance’ buyers and some corresponding general characteristics. For my later part of the analysis, I used the aforementioned classification models to devise an optimal ‘go to market’ strategy depending on. Moreover, the unbalanced nature of this dataset required us to use sampling techniques to capture the characteristics of the success class (only 5.9% of the observations). One of techniques used to handle this unbalance was to under sample the number of non-success class observations in the training dataset, while another approach to solving this problem was to over sample the number of success class observations in the training dataset. Now, I built the above six classification techniques on three separate test data frames: the unbalanced dataset, under sampled dataset and the over sampled dataset i.e., in effect, I now have performance measures of 18 different models for comparing and evaluating purposes. Since, it is critical for my analysis to correctly classify success class observations, the most important performance measures to consider is sensitivity and PPV. Hence, I have created different situation based recommendations associated with different sensitivity and PPV tradeoff values. Additionally, the cost factor associated with all my models is more important than the corresponding performance measures, as costs of False Positives and False Negatives in this business case is nowhere close to equal. As consulted with one of my connections who is a subject matter expert with respect to insurance cross-selling, I learnt that the ratio of costs of FP to that of FN is around 1:18. This indicates that models that might have low accuracy but with low overall costs are selected over models with high accuracy but high overall costs. Having said that, I have developed analysis that compares overall costs for all eighteen models for classification cutoff values ranging from ‘0’ to ‘1’. Using this analysis, I suggest situation based models to apply based on their costs and different ‘go to market’ strategies.
For my first part of the analysis, the initial data visualizations indicate that the buyers of ‘caravan mobile home insurance’ policies also tend to buy ‘car policies’ and ‘fire policies’. This is a useful insight for cross-selling the caravan policy to the existing customers of ‘car policies’ and ‘fire policies’. Moreover, other characteristics of ‘caravan mobile home insurance’ buyers generally include ‘lower level education’, ‘Income 30,000’, and ‘Married’ observations. The corresponding data visualizations can be observed in the uploaded jupyter notebook. Additionally, my results from association rules gives the best rule to be {Avg_age=3, Social_class_B2=3, Number_of_boat_policies=1} -> {Number_of_mobile_home_policies=1}. This indicates that the observations with ‘number of boat policies = 1’ tend to occur together with the variable of interest – ‘Number of mobile home policies’. Note that the confidence of this rule is 1, however, given the unbalanced nature of this dataset, the best support I could obtain was around 0.0012. This is something that should be kept in mind and taken care of when using this rule. The output of my association rules can be observed in associated jupyter notebook. The six classification models built on the unbalanced data tend to give a very high accuracy due to classifying almost all non-success class observations correct (which is the majority – 95%), however, the unbalanced nature of this dataset does not allow any of these models to learn the characteristics of the success class observations. Therefore, the high accuracy of these models is of limited use as they do not help in classifying success class observations correctly, which is my main objective. The performance measures (sensitivity, specificity, recall, precision, accuracy and ROC curves) associated with all six models fitted on the unbalanced training data and predicted on unbalanced test data is provided in the jupyter notebook. After under sampling the number of non-success class observations in the training dataset, I re-ran my six classification models and noticed an overall improvement in the performance measures associated with correctly identifying the success class observations. These results along with other performance measures and ROC curves for my classification models on the under sampled data can be found in the jupyter notebook. After under sampling, I used the technique of oversampling the number of success class observations in this training dataset and refitted my six classification models. The performance measures of these models on over sampled data can be found in the jupyter notebook. Note that the most significant part of my analysis is to identify the success class observations correctly, and hence, the two most important performance features for us are PPV and sensitivity. The PPV and sensitivity for all my models are compared in a graph in the jupyter notebook and since there is no clear winning model in terms of both, sensitivity and PPV, I recommend two different strategies based on the selected tradeoff between PPV and sensitivity. These results can be observed in my jupyter notebook. Now, I have calculated the profits associated with each of my models for classification cutoff values ranging from ‘0’ to ‘1’. This analysis can be observed in the uploaded notebook. Now, I calculated the highest profit for each of my 18 models depending on the optimal cutoff for that mode. This visualization can be observed in the notebook and I see that my model ‘logistic regression’ on the unbalanced dataset turns out to be the most profitable model out of the all 18 models at an optimal cutoff value.