This project demonstrates how the analytics event data and customer subsciption data of a product or service can be combined to identify user behaviours which predict customer churn. That analysis can be used to develop strategies to improve retention.
The SQL and python scripts included in this repo are intended to be used on the simulated social network 'SocialNet7' dataset which can be generated by running the setup of fight-churn for the book Fighting Churn with Data by Carl Gold.
The SQL scripts and Python notebooks of this project are arranged in the order in which they'd be executed when conducting a churn analysis. They should be followed and executed in this order. Any intermediate outputs are held in the output folder.
Focuses on setting up the dataset of regular customer observations used for churn analysis.
- churn-calculations includes SQL scripts for calculating:
- activity event based churn
- MRR churn
- net retention
- standard account-based churn
- insert-metrics includes a SQL script for inserting aggregated metrics for each kind of analytics event.
- event-quality-assurance contains a notebook and SQL scripts for plotting events over time.
- metric-quality-assurance contains a notebook and SQL scripts for spotting anomalous metric values which might indicate problems with event collection.
- account-tenure contains scripts for calculating account tenure (the length of time for which there is a continuous subscription for a single account) and inserting this into the data warehouse as its own metric.
- identify-active-periods contains SQL scripts for calculating the active periods (allowing for a maximum 7 day gap between subscriptions) and inserting these into an
active_period
table. These are used to determine whether or not a metric observation ended in churn. - create-churn-dataset is where the the fun begins! Here, we create a dataset of 'per-month' event metric observations which form the basis of our churn analysis.
Focuses on cohort analysis and clustering metrics into groups of behaviours. These metrics and groups could then be used by the business to target interventions to stop people from churning from the product.
- metric-summary-stats contains a notebook for checking summary statistics for all metrics (so that we can check the percentage of zero-values).
- metric-scores contains a notebook for producing normalised ("scored") versions of each event metric.
- metric-cohorts contains notebooks for performing cohort analysis on inidividual and grouped versions of our metrics.
- metric-correlations contains a notebook for calculating and visualising the matrix of Pearson correlation coefficients between metrics.
- group-behavioural-metrics contains notebooks for:
- grouping metrics together using hierarchical clustering (using SciPy's
linkage
andfcluster
) and generating a loading matrix for averaging together the scores of those groups - applying the loading matrix to create grouped scores.
- grouping metrics together using hierarchical clustering (using SciPy's
Focuses on forecasting churn probability with logistic regression.
- train-model contains notebooks for:
- training the logistic regression model.
- showing the impact of each model parameter (grouped or individual behaviour) on the likelihood of retention/churn, and checking model calibration.
- forecast-churn contains a notebook and SQL script for creating a revised version of the metric dataset which contains current customers only, ready for forecasting.
The subscription data, analytics data and the churn metrics produced from them are stored locally in a PostgreSQL database.
The simulated product is a simple social network. Eight kinds of analytics events are measured, and the names are fairly self-explanatory:
Events | |
---|---|
AdView | sees a single instance of an ad |
Dislike | dislikes a piece of content |
Like | likes a piece of content |
Message | sends a message |
New friend | makes a new friend |
Post | posts a piece of content |
Reply | replies to a message |
Unfriend | removes a friend connection |
The simulation is designed so that there are realistic relationships between the occurrence of these events and the customers’ simulated churn and renewal. These events, along with customer subscription information, are stored locally in a Postgres data warehouse.
The simulation consists of 15,000 customers over a period of 6 months from 01/01/2020 onwards. The total count for each event type over time follows a similar pattern:
We insert into the data warehouse a table of measurements. For each customer, for each event type, we count the number of times that event occurred in the last 28 days. We make these measurements every 7 days. The metric table looks like this:
where metric names are:
By checking how the maximum, average, minimum and total number of events of each type varies over time, we can be confident that there aren't any gaps in our events:
We also add a metric for account tenure, defined as the length of time a customer uses the product their current, uninterrupted period of activity, allow for a short gap of 31 days.
The churn dataset is a set of customer observations which tell us the number of each event type which occurred over the 28 days prior to the observation, along with an indicator of whether or not that observation occurred immediately before that customer churned.
We can create this dataset by working out which customer active periods did or didn't end in churn. An active period is defined as a period of time when an account is continuously subscribed through one or more individual subscriptions. We allow for a maximum gap of 7 days between subscriptions in this model.
By knowing which active periods for customers did or did not end in churn, we know which metric observations did or did not end in churn, and can create a churn dataset like this:
If we suppose that people using the product a lot are less likely to churn than people who are using the product a little or not at all, then a cohort analysis of churn that uses common behaviours from the cohorts lets us test that hypothesis. If we divide our customers into quantile-based cohorts based on their level of activity for a particular behaviour, if an activity is related to lower churn, we should find that the churn rate on the most active group is the lowest.
We can see that for behaviours which we intuitively associate with engagement with the service, such as liking posts and making new friends, more active cohorts have lower churn rates (on the left we see the raw cohort averages, and on the right, we see standard-deviation-based scores centered on the mean):
Whereas for the negative behaviour of unfriending, more active cohorts have a higher churn rate:
Perhaps most interesting are some of the rate-based or time-based metrics which can be produce from the base metrics:
Based on these relationships and further investigation, the business may decide to create interventions which encourage users to make friends more often, or consider balancing the number of ads that users see.
Although the number event types in our simulation is small, for real products and services, it can be benefitial to group certain event types together into behaviours in order to prevent information overload and discover clearer relationships with churn than individual event types might exhibit.
We can discover groups of correlated metrics perform hierarchical clustering of metrics using SciPy's linkage
and fcluster
. Doing a brief binary search shows that a correlation threshold of 0.5
gives behaviour groups which make intuitive sense (in the context of this simulation). Running the clustering algorithm which this value on an expanded set of metrics gives us these groups:
Group: | Descriptive name: | Metrics: |
---|---|---|
metric_group_1 | Posting and making friends | adview_per_month|like_per_month|newfriend_per_month|post_per_message|post_per_month |
metric_group_2 | Unfriending | unfriend_per_month|unfriend_per_newfriend |
metric_group_3 | Messaging | message_per_month|reply_per_month |
Dislikes per month | dislike_per_month | |
Seeing lots of ads | adview_per_post | |
reply_per_message | reply_per_message | |
More dislikes than likes | dislike_pcnt | |
Increase in new friends | newfriend_pcnt_chng | |
Time since new friend | days_since_newfriend |
We can see that the algorithm groups some metrics into intuitive behaviours, such as posting and making friends, unfriending, and using the platform's messaging features. We can see from the cohort analysis of this first group that the highest scoring cohort has 1/4 of the churn rate of the lowest scoring cohort:
Building a simple logistic regression model allows us to:
-
better understand how much each metric/behaviour contributes to the probability of a customer churning
-
make a forecast of how likely it is that current customers churn before their next billing renewal date
In this project, we use LogisticRegression
from sklearn.linear_model
.
The model is build on the assumption that while increased customer engagement (as measured by the metrics) leads to an increased chance of retention, this relationship is subjected to diminishing returns, i.e. for customers well above or below the average on a particular metric/behaviour, even large increases/decreases in this behaviour will only lead to small increases/decreases in retention probability. To capture this relationship, we use a sigmoid (s-curve) function in the model.
Although engagement is not directly measurable, we assume that behaviour can be estimated from the customer metrics that we've produced.
Each behavioural metric score is multiplied by an engagement strength (weight/coefficient) that captures how much the behaviour (or group of behaviours) contributes to engagement. Overall engagement is the sum of the contributions for each behaviour, plus an intercept (offset) term which shifts the sigmoidal curve such that a user with zero engagement (average user) has a realistic probability forecast for retention and churn. Without this offset term, a customer in our model with zero engagement would have a 50% churn probability, which is unrealistic.
We set up the model to predict retention because this is easier to interpret: a positive number to represent something good is more intuitive than a negative number.
The result of training the model is a set of weights which capture how much each metric (or group) contributes to engagement, and a set of retention impacts which represent how much difference in the retention probability it would make to be one standard deviation above average for this metric (or group), assuming a customer was average in all other aspects.
group_metric_offset | weight | retain_impact | group_metrics |
---|---|---|---|
metric_group_1 | 0.610597898 | 0.01275172 | adview_per_month|like_per_month|newfriend_per_month|post_per_message|post_per_month |
metric_group_2 | -0.10435934 | -0.003020317 | unfriend_per_month|unfriend_per_newfriend |
metric_group_3 | 0.677811622 | 0.013750674 | message_per_month|reply_per_month |
dislike_per_month | -0.112771914 | -0.003276928 | dislike_per_month |
adview_per_post | -0.318860417 | -0.010235154 | adview_per_post |
reply_per_message | -0.026286454 | -0.00073306 | reply_per_message |
dislike_pcnt | -0.00200644 | -5.53E-05 | dislike_pcnt |
newfriend_pcnt_chng | 0.175516025 | 0.004454125 | newfriend_pcnt_chng |
days_since_newfriend | -0.147217027 | -0.004348952 | days_since_newfriend |
offset | 3.534485894 | 0.97165323 | (baseline) |
The results for the model offset are included. The weight
for the offset is not a weight, but just the amount of the offset. The retain_impact
of the offset is the retention probability forecast for a perfectly average customer (i.e. one with zero scores in all metrics).
The relative impacts of each behaviour or group are more easily understood when represented like this:
If the retention impact for a metric is 2%, a customer who is one standard deviation above average on that metric and average in all the other metrics has a forecast retention probability 2% higher than the average retention probability.
These results would suggest that being above average in messaging activity, posting and continuing to make friends have the biggest positive impact on retention, while being above average for seeing lots of ads, going long periods between making new friends and disliking content have the biggest negative impact on retention.
Interestingly, the ratio of likes to dislikes has a negligible impact on churn/retention.
It's also important to note that:
- Being below average will have an approximately equal and opposite effect.
- If a customer is multiple standard deviations above average, there are diminishing returns, meaning that each additional standard deviation above average has less impact on the churn or retention probability.
- The same diminishing returns goes for being above average in multiple respects: the combined churn probability reduction will be lower than the sum of the quoted retention probability impacts.
If we create a metrics dataset for just our current customers who have at least 14 days tenure (in order to account for the fact that most new customers will have low metrics due to the short observation period), we can produce a forecast of how likely each customer is to churn or retain before their next observation date:
account_id | observation_date | churn_prob | retain_prob |
---|---|---|---|
3 | 10/05/2020 | 0.021592298 | 0.978408 |
4 | 10/05/2020 | 0.009901682 | 0.990098 |
5 | 10/05/2020 | 0.03178912 | 0.968211 |
6 | 10/05/2020 | 0.023116958 | 0.976883 |
7 | 10/05/2020 | 0.005575805 | 0.994424 |
The overall distribution of churn probabilities looks like this:
There is a tail of accounts with significantly higher churn probability, where it would make sense for the business to focus its efforts with appropriate interventions, but no accounts have anywhere near 100% churn probability. The vast majority have less than 20% churn probability.
We can gain confidence in our model by comparing our predictions for current customers with predictions on historical data (i.e. check that the model is calibrated) and with the _actual historical data. To do this, we compare the average churn rates:
measurement of churn | average |
---|---|
current forecasts | 0.044886 |
historical forecasts | 0.047886 |
historical average | 0.047846 |
We see that the mean predicted churn probability is very close to, although slightly lower than, those of the historical predictions, and close to the actual mean of the dataset. What might be responsible for this difference?
When comparing the ratio of the average metrics values for current and historical datasets, we see that the average metric values for current customers are around 5% higher than those for the historical dataset in most cases:
metric | historical mean | current mean | current / historical |
---|---|---|---|
like_per_month | 102.701945 | 110.268551 | 1.073675 |
newfriend_per_month | 7.079585 | 7.599192 | 1.073395 |
post_per_month | 42.991330 | 46.135846 | 1.073143 |
adview_per_month | 41.563855 | 44.131359 | 1.061773 |
dislike_per_month | 15.801018 | 16.870076 | 1.067658 |
unfriend_per_month | 0.307689 | 0.313352 | 1.018404 |
message_per_month | 62.230044 | 66.495200 | 1.068539 |
reply_per_month | 23.687189 | 25.177120 | 1.062900 |
adview_per_post | 1.618636 | 1.637879 | 1.011889 |
reply_per_message | 0.378669 | 0.398294 | 1.051825 |
post_per_message | 4.225650 | 4.486532 | 1.061738 |
unfriend_per_newfriend | 0.086465 | 0.085726 | 0.991462 |
dislike_pcnt | 0.953466 | 0.984926 | 1.032995 |
newfriend_pcnt_chng | 0.184072 | 0.239295 | 1.300005 |
days_since_newfriend | 7.193891 | 7.997308 | 1.111680 |
Because most metrics have a positive impact on engagement, higher metrics in the current dataset might explain the lower churn probability forecast.
Some ways in which this project could be expanded are:
- Measure the accuracy of the forecasts using area-under-the-curve and lift.
- Create churn cohorts with demographic and firmographic categories to see which kinds of customers tend to be more engaged.