Customer segmentation involves grouping customers into specific marketing groups, perhaps narrowing them down by gender, interests, buying habits or demographic. The process requires a thought-out strategy, understanding how to manage and group your customers and which data you will use to do this. By differentiating their customer base, businesses can better target individuals and maximize sales, link-sell appropriately and provide more tailored shopping experiences. The aim of this project is to identify patterns followed by users, grouping them with other users with the same characteristics and discover which values of every feature identify that group. To be more specific, we will identify user types in credit card holders' data collected over 6 months.
Customer Segmentation
For this project the dataset was obtained from:
The sample dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables. The variables contained are:
Variable Name | Description |
---|---|
CUSTID | Identification of Credit Card holder (Categorical) |
BALANCE | Balance amount left in their account to make purchases |
BALANCEFREQUENCY | How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) |
PURCHASES | Amount of purchases made from account |
ONEOFFPURCHASES | Maximum purchase amount done in one go |
INSTALLMENTSPURCHASES | Amount of purchase done in installment |
CASHADVANCE | Cash in advance given by the user |
PURCHASESFREQUENCY | How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) |
ONEOFFPURCHASESFREQUENCY | How frequently Purchases are happening in one go (1 = frequently purchased, 0 = not frequently purchased) |
PURCHASESINSTALLMENTSFREQUENCY | How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) |
CASHADVANCEFREQUENCY | How frequently the cash in advance being paid |
CASHADVANCETRX | Number of Transactions made with "Cash in Advanced" |
PURCHASESTRX | Number of purchase transactions made |
CREDITLIMIT | Limit of Credit Card for user |
PAYMENTS | Amount of Payment done by user |
MINIMUM_PAYMENTS | Minimum amount of payments made by user |
PRCFULLPAYMENT | Percent of full payment paid by user |
TENURE | Tenure of credit card service for user |
Variable description table
The process followed by the project was:
- Exploratory Data Analysis: Distribution and relation between features are studied. Also, missing values and repeated values are handled.
- Log Transformation: A logarithmic transformation is performed in order to deal with the skewness present in most of the dataset variables.
- Cluster Analysis: The correct number of clusters is chosen using the WCSS and Silhouette score
- Clustering Implementation: The customer segmentation is performed using K-Means algorithm.
- Final Observations: The different type of users are described and final conclusions are made.
The final type of users found in the dataset were:
- Type 1 (One off buyer): This is a common user type principally characterized by spending a lot of money in one go frequently.
- Type 2 (Inactive user): They are not frequent users of the bank services.
- Type 3 (Installment buyer): This type of users don't usually buy in one go, they usually have installment spendings instead.
- Type 4 (Average user): They are completely analogous to type 1 users but they have lower balance and lower credit limit.
- Type 5 (Responsible user): This type of user usually buys a few items rarely, also is characterized for using the payment in advance services frequently.
- Type 6 (New users): These are low tenured users, they haven't done enough purchases in any way yet.
- Type 7 (Highly active user): This is the most active kind of user.
Complete description is available at the Final Observations section
First of all, we need to import the complete dataset and take a view of the data. It is a 18 column dataset with 8905 different customer characteristics that define their behavior. Before proceeding, we need to take care of some aspects of the dataset:
- Missing values: There were identified 313 and 1 missing values for the columns 'MINIMUM_PAYMENTS' and 'CREDIT_LIMIT' respectively. For the 'MINIMUM_PAYMENTS' columns the values were filled with the median value of the columns. On the other hand, the missing value in 'CREDIT_LIMIT' was filled with the mean of the column.
- Duplicates: There were no duplicate observations in the dataset.
- Unnecessary variables: The 'CUST_ID' column it is an exclusive identifier for every user but it doesn't provide any information for the model so it's removed from the dataset.
# All the libraries needed for the project are loaded
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
np.random.seed(0)
warnings.filterwarnings('ignore')
pd.pandas.set_option('display.max_columns', None)
# The dataset is imported
data = pd.read_csv('./data/CC GENERAL.csv')
data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CUST_ID | BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | C10001 | 40.900749 | 0.818182 | 95.40 | 0.00 | 95.4 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 1000.0 | 201.802084 | 139.509787 | 0.000000 | 12 |
1 | C10002 | 3202.467416 | 0.909091 | 0.00 | 0.00 | 0.0 | 6442.945483 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 7000.0 | 4103.032597 | 1072.340217 | 0.222222 | 12 |
2 | C10003 | 2495.148862 | 1.000000 | 773.17 | 773.17 | 0.0 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0 | 12 | 7500.0 | 622.066742 | 627.284787 | 0.000000 | 12 |
3 | C10004 | 1666.670542 | 0.636364 | 1499.00 | 1499.00 | 0.0 | 205.788017 | 0.083333 | 0.083333 | 0.000000 | 0.083333 | 1 | 1 | 7500.0 | 0.000000 | NaN | 0.000000 | 12 |
4 | C10005 | 817.714335 | 1.000000 | 16.00 | 16.00 | 0.0 | 0.000000 | 0.083333 | 0.083333 | 0.000000 | 0.000000 | 0 | 1 | 1200.0 | 678.334763 | 244.791237 | 0.000000 | 12 |
# Showing number of rows (8950) and columns (18)
data.shape
(8950, 18)
# Checking for missing data through columns
temp = data.isnull().sum().to_frame()
temp.columns = ['Missing data points']
temp
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Missing data points | |
---|---|
CUST_ID | 0 |
BALANCE | 0 |
BALANCE_FREQUENCY | 0 |
PURCHASES | 0 |
ONEOFF_PURCHASES | 0 |
INSTALLMENTS_PURCHASES | 0 |
CASH_ADVANCE | 0 |
PURCHASES_FREQUENCY | 0 |
ONEOFF_PURCHASES_FREQUENCY | 0 |
PURCHASES_INSTALLMENTS_FREQUENCY | 0 |
CASH_ADVANCE_FREQUENCY | 0 |
CASH_ADVANCE_TRX | 0 |
PURCHASES_TRX | 0 |
CREDIT_LIMIT | 1 |
PAYMENTS | 0 |
MINIMUM_PAYMENTS | 313 |
PRC_FULL_PAYMENT | 0 |
TENURE | 0 |
# Filling up the missing elements with median of the 'MINIMUM_PAYMENT'
data.loc[(data['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = data['MINIMUM_PAYMENTS'].median()
# Handling NaN value from 'CREDIT_LIMIT'
avg = data['CREDIT_LIMIT'].mean()
data['CREDIT_LIMIT'] = data['CREDIT_LIMIT'].fillna(avg)
# Let's see if we have duplicated entries in the data
data.duplicated().sum()
0
# Dropping 'CUST_ID' column
data.drop('CUST_ID', axis=1, inplace=True)
For this section, we will study the distribution of the variables. Ideally, the variables should have a Gaussian distribution and for some techniques used in this project there is an assumption of Gaussian distributed data. For this, we will plot a grid of distributions for every variable. The quartiles and average of the data are also helpful for identifying data behavior.
# Distribution plot showing distributop for every behavior variable
import pylab as pl
data.hist(bins=10, figsize=(15, 15), facecolor='#4472C4',alpha=0.6, layout=(6,3))
pl.suptitle("Distribution Plot", size=16)
plt.show()
# Descriptive statistics of the data
data.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 |
mean | 1564.474828 | 0.877271 | 1003.204834 | 592.437371 | 411.067645 | 978.871112 | 0.490351 | 0.202458 | 0.364437 | 0.135144 | 3.248827 | 14.709832 | 4494.449450 | 1733.143852 | 844.906767 | 0.153715 | 11.517318 |
std | 2081.531879 | 0.236904 | 2136.634782 | 1659.887917 | 904.338115 | 2097.163877 | 0.401371 | 0.298336 | 0.397448 | 0.200121 | 6.824647 | 24.857649 | 3638.612411 | 2895.063757 | 2332.792322 | 0.292499 | 1.338331 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 | 0.000000 | 0.019163 | 0.000000 | 6.000000 |
25% | 128.281915 | 0.888889 | 39.635000 | 0.000000 | 0.000000 | 0.000000 | 0.083333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1600.000000 | 383.276166 | 170.857654 | 0.000000 | 12.000000 |
50% | 873.385231 | 1.000000 | 361.280000 | 38.000000 | 89.000000 | 0.000000 | 0.500000 | 0.083333 | 0.166667 | 0.000000 | 0.000000 | 7.000000 | 3000.000000 | 856.901546 | 312.343947 | 0.000000 | 12.000000 |
75% | 2054.140036 | 1.000000 | 1110.130000 | 577.405000 | 468.637500 | 1113.821139 | 0.916667 | 0.300000 | 0.750000 | 0.222222 | 4.000000 | 17.000000 | 6500.000000 | 1901.134317 | 788.713501 | 0.142857 | 12.000000 |
max | 19043.138560 | 1.000000 | 49039.570000 | 40761.250000 | 22500.000000 | 47137.211760 | 1.000000 | 1.000000 | 1.000000 | 1.500000 | 123.000000 | 358.000000 | 30000.000000 | 50721.483360 | 76406.207520 | 1.000000 | 12.000000 |
For better understanding consult the variable description table. Some inferences:
- The average balance amount of users is \$1,564 but this value is modified by the skewness caused by the highest balance amounts. A better is indicator is the median, that indicates $873.38 is the most approximate common balance amount among the credit card holders.
- The balance frequency is high for most cases (for more than 75% of the the credit card holders). This means, for this particular set, the users are constantly receiving or spending money.
- The number of purchases, again is affected by the skewness of high values. The normal amount of purchases made by credit card holders is 362 (median).
- The credit card holders usually have bought at max. 38 items in one go (median).
- The credit card holders usually have bought 89 items in installment purchases (median). At least 25% of the users don't buy installment purchases.
- At least 75% of the credit card holders doesn't give any cash in advance
- The purchase frequency is similar to balance frequency but doesn't take into account received money. 25% of the credit card holders has almost no purchase frequency meanwhile other 25% have a considerably high purchase frequency. The 50% remaining range between those two.
- Purchases are not happening in one go.
- Only a 25% of the credit card holders have a purchase installments high frequency.
- Cash in advance is not being paid frequently for almost all credit card holders.
- 75% of the credit card holders have made less than 17 purchase transactions.
- The credit limit for the credit card holders usually range between \$3,000 and \$3,600. The maximum credit limi tis \$30,000 and the minimum credit limit is $50.
- The most common range of amount of payments is between \$380 and $1901
- The percent of full payment paid by the user is near to 0 for the most cases.
- 25% of the users have been credit card holders for less than 12 months, on the other hand 75% have been credit card holders for 12 months.
For most of the variables we can see there is a skew to the right column, but for cases like PURCHASES_FREQUENCY or PURCHASES_INSTALLMENT_FREQUENCY are bimodal distributions, probably another mode can be considered. This is a first indicator of possible user types.
Once we have a first intuition about the data, we can study how does the variable relate to each other, we can do this with the correlation matrix. A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.
# Correlation Matrix plot
fig, axes = plt.subplots(figsize=(20,10))
sns.heatmap(data.corr(), annot=True)
<AxesSubplot:>
For simpleness, we will create a table with the top 10 pair of columns with the highest correlation. This table should include the higher positive and negative correlations.
# Top 10 pair of variables with the highest correlation
corr = data.corr()
c1 = corr.abs().unstack()
c1 = c1.sort_values(ascending = False)
c1 = pd.DataFrame(c1)
c1.reset_index(inplace=True)
c1.columns = ['V1', 'V2', 'Abs Corr']
c1 = c1[~(c1['V1']==c1['V2'])]
c1.drop_duplicates(subset='Abs Corr', inplace=True)
c1.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
V1 | V2 | Abs Corr | |
---|---|---|---|
17 | PURCHASES | ONEOFF_PURCHASES | 0.916845 |
19 | PURCHASES_INSTALLMENTS_FREQUENCY | PURCHASES_FREQUENCY | 0.862934 |
21 | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | 0.799561 |
23 | PURCHASES | PURCHASES_TRX | 0.689561 |
25 | INSTALLMENTS_PURCHASES | PURCHASES | 0.679896 |
27 | CASH_ADVANCE_TRX | CASH_ADVANCE | 0.656498 |
29 | CASH_ADVANCE | CASH_ADVANCE_FREQUENCY | 0.628522 |
31 | PURCHASES_TRX | INSTALLMENTS_PURCHASES | 0.628108 |
33 | PURCHASES | PAYMENTS | 0.603264 |
35 | PURCHASES_TRX | PURCHASES_FREQUENCY | 0.568430 |
From the displayed information above we can state that:
- In general, this is not a highly correlated dataset, for a lot of cases the correlation values are near to 0, being 0 the value for completely not related variables.
- There is not a clear tendency of negatively correlated variables, when two variables are negatively correlated means that as one variable increases its value, the other one decreases. The highest negatively correlated values in the dataset are near to 0.30. A correlation magnitude of 0.3 is not a clear indicator of tendency.
- Considering the not clear tendency, the negatively correlated variables are:
- Balance - PRC Full Payment: As the percent of full payment paid increases the balance amount decreases.
- Purchases frequency - Cash Advance frequency: As the purchases are made more frequently, the less frequent the cash in advance is going to get paid.
- The positive correlated values are more common in the dataset with also higher values showing a clearer tendency or relation between them. The positive correlation indicates that when one variable increases, the other one increases too. And when one variable decreases, the other one decreases too.
- The positive correlated variables are:
- Purchases and One off purchases: As the number of purchases increase, it's more probable to exceed increase the maximum amount spent in purchases in one go.
- Purchases installments frequency and purchases frequency: When using installment purchases, the purchases frequency increases.
- Cash advance frequency and cash advance transactions: Naturally, when there are more advance transactions, the cash advance cash advance frequency increases
- There are some variables that are clearly practically not correlated to any variable, this means that the value of those variables is independent and is not affected by other variables. These variables are:
- Balance frequency
- Minimum payments
- Tenure
- PRC full payment
- Credit limit (its only related to Balance)
After we have a better understanding of how the variables are related, we can proceed to look for patterns on the data and how can they be grouped. For that we'll create a component by component plot. For displaying purposes we will need to use a subset of the original dataset, the subset columns were defined as "cont_cols" and they were chosen by their continuous distribution. The discrete distribution columns were left apart.
# Less dicrete variables
cont_cols=['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE',
'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS']
# Component by component plot
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data[cont_cols], plot_kws={'alpha':0.3})
g.fig.suptitle("Component by component plot", fontsize=30, y=1.08)
Text(0.5, 1.08, 'Component by component plot')
From the component by component above we can not distinguish a clear view of the data and how they can be grouped. For the most cases we can see that all the values tend to be in the lower left corner of the plots (0,0 corner), this is caused by the distribution of the data, for that reason a log transformation is performed to the data. Also because of the difference of magnitudes, the data is standardized.
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_log = pd.DataFrame()
for c in [c for c in data.columns if data[c].dtype in numerics]:
df_log[c] = np.log10(data[c]+1)
# Scaling is made because of the difference of magnitudes
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
df = pd.DataFrame(scaler.fit_transform(df_log), columns=df_log.columns)
# Component by component plot
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(df_log[cont_cols], plot_kws={'alpha':0.3})
g.fig.suptitle("Component by component plot", fontsize=30, y=1.08)
Text(0.5, 1.08, 'Component by component plot')
Now we have a better view of the data but there is no clear segmentation using only two components, thus we will need a bigger dimensional workspace to group the variables. This could also mean that the data is really sticked together and it will be harder to group points. Nevertheless, there could be identified some linear segmentations in the plots. In other words, from a general view, there is a center general distribution with some line agglomerations near to the axis of the plots. This can be related to the diagonal histograms, that in most cases there is a bimodal distribution. This means that indeed there are at least two groups of users.
# Complete component by component plot
# sns.set(style="ticks", color_codes=True)
# g = sns.pairplot(df_log, plot_kws={'alpha':0.3})
# g.fig.suptitle("Component by component plot", fontsize=18, y=1.08)
Before performing the actual clustering, we will analyze how many clusters there are, in other words, how many user types are present in our dataset. For this project two different techniques were compared and analyzed in order to identify the right number of user types. By the way, there is not a definitive answer of number of clusters, there are multiple right choices according to how specific the user segmentations needs to be. We will try to be as specific as it is reasonable to be, for the number of users' type selection. The implemented techniques were:
- Within clusters sum of squares (Elbow method): This technique measures the squared average distance of all the points within a cluster using the euclidean distance between a given point and the centroid to which it is assigned. As the number of clusters increases, it will always decrease the within sum of squares (WCSS). So we can use an large number of clusters in order to get a low value of WCSS but then the clusters will start to loose sense of what they represent. To determine the optimal number of clusters, we have to select the value of k at the “elbow” i.e. the point after which the distortion/inertia start decreasing in a linear fashion.
Within sum of squares
- Silhouette score: The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
Silhouette score
from sklearn.cluster import KMeans
sc = []
range_values = range(1,20)
for i in range_values:
kmeans = KMeans(n_clusters=i)
kmeans.fit(df)
sc.append(kmeans.inertia_)
plt.figure(figsize = (16,6))
plt.plot(sc, 'bx-')
plt.title('Within clusters sum of squares for different clusters', size=16)
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
So as we mentioned before, we need to identify from the "Within clusters sum of squares for different clusters plot" the point after which the distortion/inertia start decreasing in a linear fashion, we can see from the plot that this point corresponds to the point of 6 or 7 number of clusters.
from sklearn.metrics import silhouette_score
range_n_clusters = [3, 4, 5, 6, 7, 8, 9]
sil_avgs = []
for n in range_n_clusters:
clusterer = KMeans(n_clusters=n, random_state=5)
cluster_labels = clusterer.fit_predict(df)
silhouette_avg = silhouette_score(df, cluster_labels, sample_size=500)
sil_avgs.append(silhouette_avg)
print("For n_clusters =", n,
"The average silhouette_score is :", silhouette_avg)
For n_clusters = 3 The average silhouette_score is : 0.22123601435797383
For n_clusters = 4 The average silhouette_score is : 0.27113027110343285
For n_clusters = 5 The average silhouette_score is : 0.30316834807013626
For n_clusters = 6 The average silhouette_score is : 0.3127969743900726
For n_clusters = 7 The average silhouette_score is : 0.33527918494889336
For n_clusters = 8 The average silhouette_score is : 0.2968358408748508
For n_clusters = 9 The average silhouette_score is : 0.24560181703354805
plt.figure(figsize = (16,6))
plt.plot(range_n_clusters,sil_avgs, linewidth=5)
plt.title('Average Silhouette score for different lusters', size=16)
plt.xlabel('Number of clusters')
plt.ylabel('Average Silhouette score')
plt.show()
From the average Silhouette score we detected a clear maximum value at 7 clusters. In comparison with the Within clusters sum of squares technique we realize there are consistent with each other. Finally, we conclude that in fact there are 7 different groups in our dataset.
Once we know the actual number of different type of users we have, the next step is to create every group and identify what characteristics can identify them better and what are the most common values in the feature set for that kind of user. Also, we will label all the user from the dataset using this characteristics. The main goal is to identify every user into its corresponding group according to the features or characteristics they have.
For grouping the user types we implemented the K-Means algorithm. K-Means is one of the most popular "clustering" algorithms. K-means stores
K-Means finds the best centroids by alternating between
- Assigning data points to clusters based on the current centroids
- Choosing centroids (points which are the center of a cluster) based on the current assignment of data points to clusters.
In the clustering problem, we are given a training set
K-Means optimization
# Cluster segmentation with 7 different types
n=7
kmeans = KMeans(n)
kmeans.fit(df)
labels = kmeans.labels_
# Creating a dataframe with the centroid values of every cluster
cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns = [df.columns])
# Saving the labels for every user into the dataframes previously created
data['Group'] = labels
df['Group'] = labels
In this section we present the final results obtained:
- Type of user count
- Representation of expected values for user types
- Mean characteristic values of every user type
- Median characteristic values of every user type
# Changing the range of label from 0-6 to 1-7
data['Group'] = data['Group'] + 1
# Exact amount of users in every group
data['Group'].value_counts().to_frame()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Group | |
---|---|
5 | 2083 |
3 | 1686 |
1 | 1423 |
4 | 1271 |
7 | 1202 |
2 | 843 |
6 | 442 |
def without_hue(plot, feature):
total = len(feature)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
ax.annotate(percentage, (x, y), size = 12)
plt.show()
fig = plt.figure(figsize=(16,6))
ax = sns.countplot(x='Group', data=data)
plt.title('Percent of user types', size=18)
plt.xticks(size=12)
plt.xlabel('User type', size=12)
plt.yticks(size=12)
plt.ylabel('Count', size=12)
without_hue(ax, data.Group)
# Representation of expected values for different user types
# It represents whether a variable is high or low for that type of user (with standardized values)
plt.figure(figsize = (16,8))
for i in range(len(cluster_centers)):
plt.plot(df.columns[:-1],cluster_centers.iloc[i].values, label='User type '+str(i+1), linewidth=4)
plt.title('Representation of expected values for different user types', size=18)
plt.xlabel("Characteristic")
plt.ylabel("Standardized values")
plt.xticks(rotation='vertical')
plt.grid()
plt.legend()
plt.show()
# Mean characteristic values of every type of user
data.groupby('Group').mean()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Group | |||||||||||||||||
0 | 1529.690678 | 0.974031 | 3391.027400 | 2213.178299 | 1177.849100 | 52.629338 | 0.918551 | 0.632956 | 0.683192 | 0.012294 | 0.180604 | 45.051300 | 6912.679501 | 3114.040950 | 707.384935 | 0.304857 | 11.893183 |
1 | 31.245488 | 0.304810 | 245.054828 | 136.077817 | 109.059454 | 138.592503 | 0.256321 | 0.059241 | 0.188058 | 0.018756 | 0.378410 | 3.753262 | 3666.413602 | 719.718088 | 148.584431 | 0.192722 | 11.609727 |
2 | 441.541254 | 0.914571 | 651.090119 | 34.645955 | 617.462295 | 36.102406 | 0.813117 | 0.023233 | 0.775242 | 0.008313 | 0.135231 | 14.708778 | 3026.648903 | 819.316219 | 672.436378 | 0.316140 | 11.729537 |
3 | 1125.534489 | 0.931889 | 722.144335 | 676.845523 | 45.635460 | 128.842254 | 0.353136 | 0.297128 | 0.063773 | 0.028023 | 0.440598 | 7.036192 | 3814.144196 | 1044.346673 | 647.564407 | 0.058811 | 11.830842 |
4 | 2375.374553 | 0.937841 | 19.359237 | 17.047758 | 2.369890 | 2095.462088 | 0.016886 | 0.011856 | 0.004166 | 0.286065 | 6.900144 | 0.239078 | 4271.181571 | 1776.473410 | 1087.550429 | 0.031285 | 11.731157 |
5 | 943.550005 | 0.797043 | 324.318846 | 203.610362 | 120.776357 | 1277.012860 | 0.381191 | 0.115058 | 0.240515 | 0.224835 | 3.556561 | 4.269231 | 2451.842646 | 593.104578 | 327.436489 | 0.139173 | 6.807692 |
6 | 3543.271832 | 0.978669 | 1453.754110 | 827.003453 | 627.017720 | 2841.312024 | 0.700540 | 0.306978 | 0.522380 | 0.358862 | 9.790349 | 23.505824 | 6128.535774 | 3163.358363 | 1716.450546 | 0.037460 | 11.739601 |
# Median characteristic values of every type of user
data.groupby('Group').median()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Group | |||||||||||||||||
0 | 717.764625 | 1.000000 | 2331.400 | 1322.61 | 739.57 | 0.000000 | 1.000000 | 0.666667 | 0.750000 | 0.000000 | 0 | 33 | 6300.0 | 2043.814652 | 239.701062 | 0.083333 | 12 |
1 | 10.954457 | 0.272727 | 123.500 | 0.00 | 30.90 | 0.000000 | 0.166667 | 0.000000 | 0.083333 | 0.000000 | 0 | 2 | 3000.0 | 268.128456 | 117.833863 | 0.000000 | 12 |
2 | 102.119870 | 1.000000 | 458.325 | 0.00 | 431.40 | 0.000000 | 0.916667 | 0.000000 | 0.833333 | 0.000000 | 0 | 12 | 2250.0 | 548.933799 | 172.509121 | 0.142857 | 12 |
3 | 867.728471 | 1.000000 | 450.000 | 400.60 | 0.00 | 0.000000 | 0.272727 | 0.250000 | 0.000000 | 0.000000 | 0 | 5 | 2700.0 | 615.697560 | 310.111406 | 0.000000 | 12 |
4 | 1644.496393 | 1.000000 | 0.000 | 0.00 | 0.00 | 1335.303813 | 0.000000 | 0.000000 | 0.000000 | 0.250000 | 4 | 0 | 3000.0 | 870.661235 | 547.238675 | 0.000000 | 12 |
5 | 577.815457 | 0.857143 | 152.305 | 0.00 | 0.00 | 825.410941 | 0.222222 | 0.000000 | 0.000000 | 0.166667 | 2 | 2 | 1500.0 | 250.226150 | 183.658244 | 0.000000 | 7 |
6 | 2790.631898 | 1.000000 | 913.955 | 406.34 | 331.99 | 1897.163378 | 0.750000 | 0.250000 | 0.500000 | 0.333333 | 7 | 14 | 5500.0 | 1864.755154 | 1050.820416 | 0.000000 | 12 |
Finally, we conclude that there are 7 different type of users in the dataset that can be differentiated from each other according to their characteristics, these type of users are:
- Type 1 (One off buyer): This is a common user type (about 16% of the total) principally characterized by spending a lot of money in one go frequently. They used to have a medium-high balance with a high credit limit that allow them to have this lifestyle. They usually don't use any cash in advance option.
- Type 2 (Inactive user): One of the least common group (about 10%). They are not frequent users of the bank services, they usually have low balances, low credit limit and buy using the bank services just a few times even though they have been clients for a while.
- Type 3 (Installment buyer): This represents a really common active user type (about 19%). This type of users uses a lot the bank services similarly to the Type 1 users but with lower balance and lower credit limit. Another important aspect is that they don't usually buy in one go, they usually have installment spendings instead. They usually don't use any cash in advance option.
- Type 4 (Average user): This is an average type of users, represents about the 14% of the total. They are completely analogous to type 1 users, the most important differences are that they have lower balance and lower credit limit, hence lower amount of purchases done and lower frequency. But the main difference is that type 1 users tend to have a high percent of full payment paid meanwhile this type of users have lower values.
- Type 5 (Responsible user): This is the most common user type (about 23% of the total). This type of user usually buys a few items rarely, also is characterized for using the payment in advance services frequently.
- Type 6 (New users): This is the least common user type (about 5% of the total). These are low tenured users, they haven't done enough purchases in any way yet, they have low credit limit and low payments. One important fact is that new users tend to pay in advance meanwhile users with longer tenure don't.
- Type 7 (Highly active user): This user represents about 13% of the total. This is the most active kind of user, they have the highest balance among all types, they make a lot of transactions, purchases and payments. They are also the most active users in all kinds of advance payments.