-
Notifications
You must be signed in to change notification settings - Fork 3
/
5.Facebook Performance - Clustering.R
867 lines (591 loc) · 28.7 KB
/
5.Facebook Performance - Clustering.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
############Facebook Performance Metrics ################
# The full dataset was described and analyzed in:
#
# S. Moro, P. Rita and B. Vala. Predicting social media performance metrics and evaluation of the impact on
# brand building: A data mining approach. Journal of Business Research, Elsevier, In press, Available online
# since 28 February 2016.
browseURL("https://archive.ics.uci.edu/ml/datasets/Facebook+metrics")
###### Data Description ###########
# The data is related to posts' published during the year of 2014 on the
# Facebook's page of a renowned cosmetics brand. This dataset contains 500 of
# the 790 rows and part of the features analyzed by Moro et al. (2016). The
# remaining were omitted due to confidentiality issues.
#Have our activities been successful? This is a study of user behavior that will
# most likely be shared with management or advertising agencies.
#Categorical Variable descriptions:
# 1. Type
# -the type of post (link, photo, status or video)
# 2. Category
# - Type 1
# - Type 2
# - Type 3
# 3. Post.Month
# - month of post
# 4. Post.Weekday
# - day of week of the post
# 5. Post.Hour
# - hour of day of the post
# 6. Paid
# - binary variable whether the post was paid or unpaid
#Let's look at the current definitions of our Continuous/Numeric variables
# which also includes some key metrics/KPI's we will analyze.
# 1. impressions: impressions measure the number of times your posts were seen.
# That includes if one post was seen multiple times by a single user.
# 2. Page Likes: are the number of people that follow your brand on Facebook. They liked
# your page or opted-in to be able to have your posts show up in their feed. You
# can think of them as fans or subscribers. Page Likes show your audience size on Facebook.
# Over time, that number should be growing. Facebook performance metrics of a renowned cosmetic's brand Facebook page.
# 3. Reach: Reach is an indication of the number of people who see your material on Facebook.
# 4. Engaged Users: The number of people who engaged with your Page. Engagement includes any
# click or story created.
# 5. Page Consumptions: Total number of clicks
# 6. Interactions: The number of interactions (reactions, comments, shares)
# during a selected time range. The interactions are shown on the day the post
# they relate to was published.
###### Load libraries ###########
library(rpart)
library(psych)
library(dplyr)
library(corrplot)
library(ggplot2)
library(cowplot)
library(GGally)
library(VIM)
library(gmodels)
library(cluster)
library(mclust)
library(car)
######### Custom functions #########
#this function will help us inspect cluster solutions more efficiently later on
seg_summ_function <- function(data, groups) {
aggregate(data, list(groups), function(x) mean(as.numeric(x)))
}
######### Import Data ###################
#import csv file
fb_data <- read.csv("~/Documents/GitHub/Machine-Learning-R/Machine-Learning-R/datasets/dataset_Facebook.csv", sep = ';')
#check results of data import
str(fb_data)
########### Check Data #################
#summarize variables in dataframe
summary(fb_data)
#let's look at the post type
table(fb_data$Type)
#let's look at the category
table(fb_data$Category)
#let's make a cross-table of type and category
table(fb_data$Category, fb_data$Type)
#check column names - some of them are quite long
names(fb_data)
#we need to rename some of the columns to make the names shorter
fb_data <- rename(fb_data,
page_likes = Page.total.likes,
reach = Lifetime.Post.Total.Reach,
impressions = Lifetime.Post.Total.Impressions,
engaged_users = Lifetime.Engaged.Users,
consumers = Lifetime.Post.Consumers,
consumptions = Lifetime.Post.Consumptions,
liked_impr = Lifetime.Post.Impressions.by.people.who.have.liked.your.Page,
liked_reach = Lifetime.Post.reach.by.people.who.like.your.Page,
liked_engaged = Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post,
interactions = Total.Interactions)
#check results
names(fb_data)
########## Any Missing Data #############
## Show cases with missing values
fb_data[!complete.cases(fb_data),] #we get 5 rows that have missing data
#using aggr function from VIM packages --missing values are in red
aggr(fb_data, prop = F, numbers = T)
#Remove the missing rows????
# 5 rows won't impact our results that much, otherwise we could impute
dim(fb_data) #500 rows of data
#Uncomment below to remove the 5 rows
#fb_data <- fb_data[complete.cases(fb_data),]
#dim(fb_data) #If you run this code, you will have 495 rows of data
#Imputation method - we will just impute with 0 instead
fb_data[is.na(fb_data)] = 0
#Check results
dim(fb_data) #500 rows of data so we have successfully imputed
###### Explore Data: Data visualization ##############
##### Univariate Analysis: Barplots for important categorical variables (x) ######
# Don't map a variable to y
bar1 <- ggplot(fb_data, aes(x=factor(Type)))+
geom_bar(stat="count", width=0.7, fill="sienna1")+
theme_minimal()
#Significantly more photo posts than other types
bar2 <- ggplot(fb_data, aes(x=factor(Category)))+
geom_bar(stat="count", width=0.7, fill="steelblue")+
theme_minimal()
#More category 1 posts, then 3, then 2
bar3 <- ggplot(fb_data, aes(x=factor(Post.Month)))+
geom_bar(stat="count", width=0.7, fill="salmon4")+
theme_minimal()
# More posts in October
bar4 <- ggplot(fb_data, aes(x=factor(Post.Weekday)))+
geom_bar(stat="count", width=0.7, fill="red4")+
theme_minimal()
# Saturday and Sunday are where the most posts happens
plot_grid(bar1, bar2, bar3, bar4, labels = "AUTO")
#Key insights:
# 1. Significantly more photo posts than other types
# 2. More category 1 posts, then 3, then 2
# 3. More posts in October
# 4. Saturday and Sunday are where the most posts happens
#####Univariate Analysis: Boxplots for important independent numeric variables ######
# page_likes on y axis
ggbi1 <- ggplot(data = fb_data, aes(x = Type, y = page_likes, fill = Type))+geom_boxplot() + stat_summary(fun=mean, geom="point", shape=20, size=8, color="red", fill="red")+theme_minimal()
ggbi2 <- ggplot(data = fb_data, aes(x = factor(Category), y = page_likes, fill = Category))+geom_boxplot()+theme_minimal()
ggbi3 <- ggplot(data = fb_data, aes(x = factor(Post.Month), y = page_likes))+geom_boxplot()+theme_minimal()
ggbi4 <- ggplot(data = fb_data, aes(x = factor(Post.Weekday), y = page_likes))+geom_boxplot()+theme_minimal()
ggbi5 <- ggplot(data = fb_data, aes(x = factor(Post.Hour), y = page_likes))+geom_boxplot()+theme_minimal()
ggbi6 <- ggplot(data = fb_data, aes(x = factor(Paid), y = page_likes, fill = factor(Paid)))+geom_boxplot()+theme_minimal()
plot_grid(ggbi1, ggbi2, ggbi3, ggbi4, labels = "AUTO")
plot_grid(ggbi5, ggbi6, labels = "AUTO")
# Key insights on Page Likes:
# 1. Videos and status gets better page likes than links and photos.
# 2. Page likes have been increasing over the year which is a good sign.
# 3. Page likes are consistent across days of week with more variability on days 5 & 6.
# 4. When it comes to the time of hour, it's quite strange. Lots of posting happening in the early
# morning hours. Not sure if this information/insight is actionable from a marketing stand-point as this process
# is most likely an automated process.
# 5. No difference in page_likes between paid or unpaid segments
###### Univariate Analysis: Histogram plots for important independent numeric variables##
ggplot(data=fb_data, aes(x=reach)) + geom_histogram()
hist1 <- ggplot(data = fb_data, aes(x = reach))+
geom_histogram(fill = "lightblue", binwidth = 25, colour = "black") +
geom_vline(aes(xintercept = median(reach), linetype = "dashed"))
hist2 <- ggplot(data = fb_data, aes(x = impressions))+
geom_histogram(fill = "blue", colour = "black") +
geom_vline(aes(xintercept = median(impressions)), linetype = "dashed")
hist3 <- ggplot(data = fb_data, aes(x = engaged_users))+
geom_histogram(fill = "red", binwidth = 3, colour = "black") +
geom_vline(aes(xintercept = median(engaged_users)), linetype = "dashed")
plot_grid(hist1, hist2, hist3, labels = "AUTO")
# When it comes to some core FB metrics in regards to visibility
# (reach, impressions, engaged_users),
##########Bivariate analysis: Boxplots for TYPE (as x variable) vs important numeric (y variables) ##########
bi1 <- ggplot(data = fb_data, aes(x = Type, y = page_likes, fill = Type))+geom_boxplot()
bi2 <- ggplot(data = fb_data, aes(x = Type, y = reach, fill = Type))+geom_boxplot()
bi3 <- ggplot(data = fb_data, aes(x = Type, y = impressions, fill = Type))+geom_boxplot()
bi4 <- ggplot(data = fb_data, aes(x = Type, y = engaged_users, fill = Type))+geom_boxplot()
plot_grid(bi1, bi2, bi3, bi4, labels = "AUTO")
bi5 <- ggplot(data = fb_data, aes(x = Type, y = consumers, fill = Type))+geom_boxplot()
bi6 <- ggplot(data = fb_data, aes(x = Type, y = consumptions, fill = Type))+geom_boxplot()
bi7 <- ggplot(data = fb_data, aes(x = Type, y = liked_impr, fill = Type))+geom_boxplot()
bi8 <- ggplot(data = fb_data, aes(x = Type, y = liked_reach, fill = Type))+geom_boxplot()
bi9 <- ggplot(data = fb_data, aes(x = Type, y = liked_engaged, fill = Type))+geom_boxplot()
plot_grid(bi5, bi6, bi7, bi8, bi9, labels = "AUTO")
bi10 <- ggplot(data = fb_data, aes(x = Type, y = comment, fill = Type))+geom_boxplot()
bi11 <- ggplot(data = fb_data, aes(x = Type, y = like, fill = Type))+geom_boxplot()
bi12 <- ggplot(data = fb_data, aes(x = Type, y = share, fill = Type))+geom_boxplot()
bi13 <- ggplot(data = fb_data, aes(x = Type, y = interactions, fill = Type))+geom_boxplot()
plot_grid(bi10, bi11, bi12, bi13, labels = "AUTO")
##########Bivariate analysis: Boxplots for PAID (as x variable) vs important numeric (y variables) ##########
bipd1 <- ggplot(data = fb_data, aes(x = factor(Paid), y = page_likes, fill = factor(Paid)))+geom_boxplot()
bipd2 <- ggplot(data = fb_data, aes(x = factor(Paid), y = reach, fill = factor(Paid)))+geom_boxplot()
bipd3 <- ggplot(data = fb_data, aes(x = factor(Paid), y = impressions, fill = factor(Paid)))+geom_boxplot()
bipd4 <- ggplot(data = fb_data, aes(x = factor(Paid), y = engaged_users, fill = factor(Paid)))+geom_boxplot()
plot_grid(bipd1, bipd2, bipd3, bipd4, labels = "AUTO")
#very little difference between paid and unpaid
bipd5 <- ggplot(data = fb_data, aes(x = factor(Paid), y = consumers, fill = factor(Paid)))+geom_boxplot()
bipd6 <- ggplot(data = fb_data, aes(x = factor(Paid), y = consumptions, fill = factor(Paid)))+geom_boxplot()
bipd7 <- ggplot(data = fb_data, aes(x = factor(Paid), y = liked_impr, fill = factor(Paid)))+geom_boxplot()
bipd8 <- ggplot(data = fb_data, aes(x = factor(Paid), y = liked_reach, fill = factor(Paid)))+geom_boxplot()
bipd9 <- ggplot(data = fb_data, aes(x = factor(Paid), y = liked_engaged, fill = factor(Paid)))+geom_boxplot()
plot_grid(bipd5, bipd6, bipd7, bipd8, bipd9, labels = "AUTO")
bipd10 <- ggplot(data = fb_data, aes(x = factor(Paid), y = comment, fill = factor(Paid)))+geom_boxplot()
bipd11 <- ggplot(data = fb_data, aes(x = factor(Paid), y = like, fill = factor(Paid)))+geom_boxplot()
bipd12 <- ggplot(data = fb_data, aes(x = factor(Paid), y = share, fill = factor(Paid)))+geom_boxplot()
bipd13 <- ggplot(data = fb_data, aes(x = factor(Paid), y = interactions, fill = factor(Paid)))+geom_boxplot()
plot_grid(bipd10, bipd11, bipd12, bipd13, labels = "AUTO")
############### 2-Way Cross Tabulation Tables to compare categorical variables ################
#Type
CrossTable(fb_data$Post.Month, fb_data$Type, digits=2, prop.c = TRUE,
prop.r = TRUE, prop.t = FALSE, chisq = FALSE, format = "SAS", expected = FALSE)
#85% of posts were photos - by far, followed by status posts
#Oct & July are the busiest months when it comes to posts
CrossTable(fb_data$Post.Weekday, fb_data$Type, digits=2, prop.c = TRUE,
prop.r = TRUE, prop.t = FALSE, chisq = FALSE, format = "SAS", expected = FALSE)
#same as above just flipped
CrossTable(fb_data$Type, fb_data$Post.Weekday, digits=2, prop.c = TRUE,
prop.r = TRUE, prop.t = FALSE, chisq = FALSE, format = "SAS", expected = FALSE)
#32% of photos are posted on the weekends
CrossTable(fb_data$Paid, fb_data$Type, digits=2, prop.c = TRUE,
prop.r = TRUE, prop.t = FALSE, chisq = FALSE, format = "SAS", expected = FALSE)
CrossTable(fb_data$Paid, fb_data$Category, digits=2, prop.c = TRUE,
prop.r = TRUE, prop.t = FALSE, chisq = FALSE, format = "SAS", expected = FALSE)
############Exploring Associations between continuous variables with scatterplots####
ggplot(data = fb_data, aes(x = reach))+
geom_histogram(fill = "lightblue", binwidth = 25, colour = "black") +
geom_vline(aes(xintercept = median(reach), linetype = "dashed"))
ggplot(fb_data, aes(page_likes, reach)) +
geom_point()+
geom_smooth(method=lm)
ggplot(fb_data, aes(page_likes, impressions)) +
geom_point()+
geom_smooth(method=lm)
ggplot(fb_data, aes(comment, page_likes)) +
geom_point()+
geom_smooth(method=lm)
######### Explore Data: Correlation Matrix & Pairwise Scatterplot ##############
#subset dataframe to get just all numeric variables
num_vars<- fb_data %>%
select_if(is.numeric)
head(num_vars)
# Build Correlation Matrix structure:
num_vars %>%
cor() %>%
corrplot(type = "upper", insig = "blank", diag = FALSE, addCoef.col = "grey")
# create a pairwise scatterplot - too many variables to plot at once
# fb_data%>%
# ggpairs()
cor(num_vars)
##Insights from correlation matrix
# -lots of extreme values for certain variables
# -highly correlated variables - share, likes, interactions
#Establish a baseline with linear regression
linmod <- lm(page_likes ~ ., data = fb_data)
summary(linmod)
#1 coefficient not defined because of singularities
#We need to address collinearity. Many significant variables.
alias(lm(page_likes ~ ., data = fb_data))
#interactions is highly correlated
#remove interactions from the model
linmodnoInt <- lm(page_likes ~ . -interactions, data = fb_data)
summary(linmodnoInt)
#Let's calculate the VIF
vif(linmodnoInt)
#very high VIF's for:
# reach
# impressions
# engaged_users
# consumers
# liked_imp
# liked_reach
# liked_engaged
# commment - slightly high, but okay
# like
# share
linmodnocol <- lm(page_likes ~ Type + Category + Post.Month +
Post.Weekday + Post.Hour + Paid + consumptions, data = fb_data)
summary(linmodnocol)
#Let's calcualte the VIF
vif(linmodnocol) # - much better no collinear variables
####################### Target variable for modeling ####################
#What is the question we're trying to answer?:
#----Have we successfully built our brand on Facebook?
# What is our target variable for our modeling? Which KPI's are we trying to explain?
# - we will use page likes as our target kpi >> engagement with brand
# - we will also use interactions as our 2ndary kpi >> customer behavior
# - will use reach as a 3rd kpi
#try a new baseline model with the addt'l variables we removed earlier
#we want to add reach and interactions
linmodImproved <- lm(page_likes ~ Type + Category + Post.Month +
Post.Weekday + Post.Hour + Paid + consumptions + reach + interactions, data = fb_data)
summary(linmodImproved)
#Let's calcualte the VIF
vif(linmodImproved) # - much better no collinear variables
########### ML Method 1: Hclust ####################
########### Step 1: Transform data for Hclust model####################
#Cannot assume that factor variables are irrelevant
#to our cluster definitions; it is better to use all data
#the daisy function in the cluster package works with mixing
#data types by re-scaling the values, so we can use euclidean distance
dataset <- fb_data %>%
select(Type, Category, Post.Month,
Post.Weekday, Post.Hour, Paid,
reach, impressions, engaged_users,
consumers, comment, like, share)
str(dataset)
#Get data into the right format if needed - We need to fix some variables
dataset$Type <- as.factor(dataset$Type)
dataset$Category <- as.factor(dataset$Category)
dataset$Post.Month <- as.factor(dataset$Post.Month)
dataset$Post.Weekday <- as.factor(dataset$Post.Weekday)
dataset$Post.Hour <- as.factor(dataset$Post.Hour)
dataset$Paid <- as.factor(dataset$Paid)
str(dataset)
#scale data using daisy function
fb_data_dist <- daisy(dataset) #works with mixed data types if we wanted to keep categorical variables in the dataset
#select first few rows/columns to check if scaling worked
as.matrix(fb_data_dist) [1:5, 1:5]
#the distances look reasonable(zeroes on the diagnol, symmetric, scaled[0,1])
#so we proceed to the hiearchical cluster method itself, invoking hclust on the
#dissimilarity matrix
########### Step 2: Apply Hclust Algorithm -####################
fb_data_hc <- hclust(fb_data_dist, method = "complete")
#draw cluster object
plot(fb_data_hc)
# using dendrogram objects
hcd <- as.dendrogram(fb_data_hc)
#str(hcd) #check results
# alternative way to get a dendrogram
plot(hcd)
#Dendogram is hard to read, but there are large clusters up top
# and lots of smaller and thinner clusters towards the bottom
#It will be a bit hard to get the right cluster assignments.
##### Step 3: Hclust Goodness-of-fit #########
#Finally we check for goodness of fit for a hierarchical clustering solution
#method 1: cophenetic correlation coefficient (CPCC) which assesses how well
#a dendogram matches the true distance metric.
#Use cophenetic() to get the distances from the dendogram and compare it to the dist()
#metrics with cor()
cor(cophenetic(fb_data_hc), fb_data_dist)
# CPCC is interpreted similarly to Pearson's r. A CPCC > 0.6 indicates
# a relatively strong fit, meaning the hierarchical tree represents the distances between
# observations well. .60-.80 a strong relationship
# Cor = 0.5381611
# However, in our case, the relationship isn't so strong,
# this suggests a medium fit!
#We can see where the dendogram would be cut
#by overlaying its plot with rect.hclust(), specifying
#the number of groups we want (k=...)
#the dendogram suggests possibly 5 clusters
#albeit this is really open to interpretation
plot(fb_data_hc)
rect.hclust(fb_data_hc, k=5, border="red")
#We obtain the assignment vector for observations using cutree()
#Basically, this means we need to get the cluster assignments
fb_data_hc_segment <- cutree(fb_data_hc, k=5)
table(fb_data_hc_segment) #check assignments
#Groups 1 dominates the assignment, followed by
#groups 4 and 5, 3, & 2. The clusters are not well-balanced.
#group 2 is the smallest clusters at 38 observations.
#Let's try to create larger segments k = 8
plot(fb_data_hc)
rect.hclust(fb_data_hc, k=8, border="blue")
fb_data_hc_segment <- cutree(fb_data_hc, k=8)
table(fb_data_hc_segment) #check assignments
#Let's try to create larger segments k= 10
plot(fb_data_hc)
rect.hclust(fb_data_hc, k=10, border="green")
fb_data_hc_segment <- cutree(fb_data_hc, k=10)
table(fb_data_hc_segment) #check assignments
#We will be able to answer the following questions:
# 1-Are there obvious differences in group means?
# - Hypothesis testing to help us understand the segments
# 2-Are the groups differentiated enough to tell us a story?
# 3-Do we see immediately odd results (e.g. a mean equal to the value of
# one data level?)
#check results of cluster assignments against dataset using custom function
hclust_segs <-seg_summ_function(fb_data, fb_data_hc_segment)
hclust_segs
#Category variable
#1 - Link
#2 - Photo
#3 - Status
#4 - Video
table(fb_data$Paid)
#0-unpaid
#1-paid
table(fb_data$Type) #how was type
#variable recoded in clust algorithm
#1 - link
#2 - Photo
#3 - Status
#4 - Video
#Insights based on hclust ML algorithm:
# The questions we are trying to answer include:
# -What posts are most likely to be engaging?
# -What are the characteristics of an engaging post?
# - Are there any characteristics we can use to gain reach & therefore build our brand?
#Group 4 stands out at having the most reach
# type (2) post - photo that has the most reach
# and average interactions and next to highest
#consumptions and is an unpaid segment
#if we care about audience reach for very little
#money this segment should take care of that
# Group 1 dominates in number of observations. It's an
# unpaid segment with
# the most average page likes. Type is photo
# This indicates that posting photos goes a long way in
# engaging users with our FB page.
# Groups 3 stands out immediately as having
#the most interactions. Not really
# actionable here because of the small number of observations
#All segments indicate that the day of the week
#the post lands on is important. Also that
#photos tend to dominate the type of post that helps
#us reach our branding goals.
######### ML Method 2: Mean-Based Clustering: kmeans() ############
#Only appropriate for numeric data or data that can be reasonably coerced to numeric.
#Unlike higher -order categorical variables, binary factors can be coerced to numeric with
#no alternation of meaning.
#Although it's not optimal to cluster binary/categorical values
#with k-means, we will attempt it
#We just need to fix the type category
#1 - Link
#2 - Photo
#3 - Status
#4 - Video
fb_data_num <- dataset %>%
mutate(type =
case_when(
Type == "Link" ~ 1,
Type == "Photo" ~ 2,
Type == "Status" ~ 3,
Type == "Video" ~ 4,
TRUE ~ 5)
)
#check results of case when statement
str(fb_data_num$type)
table(fb_data_num$type)
#change from numeric to int
fb_data_num$type <- as.integer(fb_data_num$type)
#remove the type variable
fb_data_num <- fb_data_num[,-1]
#check structure
str(fb_data_num)
####### Step 1: Re-scale data and create K means cluster - 5 centers ######
#transformed dataframe from above
str(fb_data_num)
#Now we need to rescale
fb_scaled <- dist(fb_data_num, method = "euclidean")
# Build a kmeans model - start with 5 clusters & nstart = 20
kmeans_model <- kmeans(fb_scaled, centers = 5, nstart = 20)
# Extract the cluster assignment vector from the kmeans model
clust_kmeans_model <- kmeans_model$cluster
# Create a new data frame appending the cluster assignment
fb_data_num2 <- mutate(fb_data_num, cluster = clust_kmeans_model)
# Calculate the size of each cluster
count(fb_data_num2, cluster)
#Cluster 3 only has 11 obs was probably too many clusters to begin with
# Calculate the mean for each category
fb_data_num2 %>%
group_by(cluster) %>%
summarise_all(list(mean))
#dev.off() --uncomment and run if there is a problem with plotting
# Plot and color main KPIs using their cluster
plot1 <- ggplot(fb_data_num2, aes(x = reach, y = consumptions, color = factor(cluster))) +
geom_point()
plot2 <- ggplot(fb_data_num2, aes(x = page_likes, y = consumptions, color = factor(cluster))) +
geom_point()
plot_grid(plot1, plot2, labels = "AUTO")
######## Step 2: Evaluate K-means cluster with many k's & Elbow plot #########
library(purrr)
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = fb_data_num, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10 ,
tot_withinss = tot_withinss
)
head(elbow_df)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
# Based on this plot, the k to choose is 3, the elbow
# occurs at 5 before it flattens out, but the first bend is at 3.
####### Step 3: Silhouette analysis: observation level performance ####
# • Values close to 1 suggest that the observation is well matched to the assigned cluster
# • Values close to 0 suggest that the observation is borderline matched between two clusters
# • Values close to -1 suggest that the observations may be assigned to the wrong cluster
# Generate a k-means model using the pam() function with a k = 3
pam_k3 <- pam(fb_data_num, k = 3)
# Plot the silhouette visual for the pam_k3 model
plot(silhouette(pam_k3))
library(factoextra)
fviz_silhouette(pam_k3) #print avg silhouette in table format
#Group 1 appears to be well matched
#Group 2 is not well-matched
#Group 3 is borderline
#However, the avg silhouette width = 0.54 when k=3
#Let's try again with 5 clusters
# Generate a k-means model using the pam() function with a k = 5
pam_k5 <- pam(fb_data_num, k = 5)
# Plot the silhouette visual for the pam_k5 model
plot(silhouette(pam_k5))
library(factoextra)
fviz_silhouette(pam_k5) #print avg silhouette in table format
#average silhouette went down but Groups 1,3,5 are now better matched
#2&4 are perhaps borderline
##We continue on....
######### ML Method 3: Model-based Clustering Mclust() ############
#A limitation of k-means is that it requires specifying the number of clusters
#and it can be difficult to determine whether one solution is better than another.
#There one algorithm itself that can suggest how many
#clusters are in the data by optimizing the fit of the model: model-based clustering
#mclust models data with normal or Guassian distribution,
# it only uses numeric data.
#data not entirely normally distributed but
#will attempt this anyway to see what kind of results we get.
fb_mclust_mod <- Mclust(fb_data_num)
summary(fb_mclust_mod)
#This tells us the data has 8 clusters - WTF!
#We also see log-likelihood information which we can use to
# compare models.
#We try a 5 cluster solution
fb_data_num_mc5 <- Mclust(fb_data_num, G=5)
summary(fb_data_num_mc5)
#Forcing it to find 5 clusters resulted in a similar model
# with slightly lower log-likelihood
#The clusters at first glance look well-situated.
#Comparing models with BIC()
#We compare the original cluster and
#5-cluster models using the Bayesian information criterion
BIC(fb_mclust_mod, fb_data_num_mc5)
#The lower the value of the BIC, the better
#Bic difference is: -1803.56
42952.97-44756.53
#8 clusters seem to work better so that is what we'll go with, even though
#the number of clusters can seem a bit arbitrary.
######### ML Method 4: k-means revisited with 8 centers ############
# Build a kmeans model - start with 8 clusters
kmeans_model_8clus <- kmeans(fb_scaled, centers = 8, nstart = 20)
# Extract the cluster assignment vector from the kmeans model
labels_8clus <- kmeans_model_8clus$cluster
# Create a new data frame appending the cluster assignment
fb_data_num_8clus <- mutate(fb_data_num, cluster = labels_8clus)
# Calculate the size of each cluster
count(fb_data_num_8clus, cluster)
# Calculate the mean for each category
fb_data_num_8clus %>%
group_by(cluster) %>%
summarise_all(list(mean))
#We visualize the clusters.
library(cluster)
clusplot(fb_data, labels_8clus, color=TRUE, shade=TRUE, labels=4, lines=0, main = "K-means 8-cluster solution")
#this shows the observations on a multi-dimensional scaling plot with group membership identified by the
#ellipses.
#Cluster 6 stretches too wide.
######## Final summary #############
# Compare methods - hclust vs. kmeans with 8 clusters
table(fb_data_hc_segment, kmeans_model_8clus$cluster)
#Our hclust model assigns most of the observations to cluster 8,
#in the 8 segment solution.
# Although 8 clusters were more statistically sound,
# it didn't lead to better insights than our original
# hclust model.The insights we gathered from the hclust model still hold.
#####ANALYSIS SUMMARY FOR MARKETING STAKEHOLDERS ------------------------------------
# Key insights on Page Likes:
# 1. Videos and status gets better page likes than links and photos.
# 2. Page likes have been increasing over the year which is a good sign.
# 3. Page likes are consistent across days of week with more variability on days 5 & 6.
# 4. When it comes to the time of hour, it's quite strange. Lots of posting happening in the early
# morning hours. Not sure if this information/insight is actionable from a marketing stand-point as this process
# is most likely an automated process.
# 5. No difference in page_likes between paid or unpaid segments
#Group 4 stands out at having the most reach
# type (2) post - photo that has the most reach
# and average interactions and next to highest
#consumptions and is an unpaid segment
#if we care about audience reach for very little
#money this segment should take care of that
# Group 1 dominates in number of observations. It's an
# unpaid segment with
# the most average page likes. Type is photo
# This indicates that posting photos goes a long way in
# engaging users with our FB page.
# Groups 3 stands out immediately as having
#the most interactions. Not really
# actionable here because of the small number of observations
#All segments indicate that the day of the week
#the post lands on is important. Also that
#photos tend to dominate the type of post that helps
#us reach our branding goals.
#References:
#\[1\]. Chapman, C. and McDonnell Feit, E., (2015). R for marketing
#research and analytics. Cham: Springer, pp.47-59, 162-191.