-
Notifications
You must be signed in to change notification settings - Fork 2
/
52-kmeans.qmd
executable file
·283 lines (253 loc) · 12.1 KB
/
52-kmeans.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
# Clustering
```{r setup}
#| include: false
source("_common.R")
```
A large class of unsupervised learning methods is
collectively called *clustering*. The objective
can be described as identifying different groups
of observations that are closer to each other ("clustered") than to those of
the other groups. The data consist of *n* observations $X_1$, $X_2$, ...,
$X_n$, each with *p* features. In general, the number of groups is
unknown and needs to be determined from the data.
In this course we will discuss both model-free
and model-based clustering methods. In the first
group we will present K-means (and other
related methods), and hierarchical (agglomerative) clustering
methods. Model-based clustering is based on the assumption that the data
is a random sample, and that the distribution of
the vector of features **X** is a combination of
different distributions (technically: a *mixture model*).
In the latter case, for observations belonging to a
group, the vector **X** is assumed
to have a distribution in a specific (generally parametric) family.
Some of these model-based methods treat the group labels as
missing (unobserved) responses, and rely on the assumed model to
infer those missing labels.
## K-means, K-means++, K-medoids
Probably the most intuitive and easier to explain
unsupervised clustering algorithm is K-means (and
its variants *K-means++* and K-medoids, a.k.a. *pam*,
partition around medoids). The specifics of the K-means
algorithm were discussed in class. Here we will illustrate its use
on a few examples.
### UN votes example.
These data contain the historical voting patterns
of United Nations members. More details can be
found here at [@DVN/LEJUQZ_2009].
The UN was founded in 1946 and it contains 193 member states.
The data include only "important" votes, as classified
by the U.S. State Department. The votes for each country
were coded as follows: Yes (1), Abstain (2), No (3),
Absent (8), Not a Member (9). There were
368 important votes, and 77 countries
voted in at least 95% of these. We focus on these
UN members. Our goal is to explore whether
voting patterns reflect political
alignments, and also whether countries vote along known
political blocks. Our data consists of 77 observations
with 368 variables each. More information on these data can be found
[here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379).
The dataset is organized by vote (resolution), one per row,
and its columns contain the corresponding vote of each country
(one country per column).
We first read the data, and limit ourselves to resolutions where
every country voted without missing votes:
```{r unvotes, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
X <- read.table(file = "data/unvotes.csv", sep = ",", row.names = 1, header = TRUE)
X2 <- X[complete.cases(X), ]
```
We now compute a K-means partition using the function `kmeans`
with *K = 5*, and look at the resulting groups:
```{r kmeans.un1, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
set.seed(123)
b <- kmeans(t(X2), centers = 5, iter.max = 20, nstart = 1)
table(b$cluster)
```
If we run `kmeans` again, we might get a different partition:
```{r kmeans.un2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
b <- kmeans(t(X2), centers = 5, iter.max = 20, nstart = 1)
table(b$cluster)
```
It is better to consider a large number of random starts and
take the **best** found solution (what does *best* mean in this
context? in other words, how does the algorithm decide which one is
the solution it should return?)
```{r kmeans.un3, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
# Take the best solution out of 1000 random starts
b <- kmeans(t(X2), centers = 5, iter.max = 20, nstart = 1000)
split(colnames(X2), b$cluster)
```
It may be better to look at the groups on a map:
```{r kmeans.map1, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
library(rworldmap)
library(countrycode)
these <- countrycode(colnames(X2), "country.name", "iso3c")
malDF <- data.frame(country = these, cluster = b$cluster)
# malDF is a data.frame with the ISO3 country names plus a variable to
# merge to the map data
# This line will join your malDF data.frame to the country map data
malMap <- joinCountryData2Map(malDF,
joinCode = "ISO3",
nameJoinColumn = "country"
)
# colors()[grep('blue', colors())]
# fill the space on the graphical device
par(mai = c(0, 0, 0, 0), xaxs = "i", yaxs = "i")
mapCountryData(malMap,
nameColumnToPlot = "cluster", catMethod = "categorical",
missingCountryCol = "white", addLegend = FALSE, mapTitle = "",
colourPalette = c("darkgreen", "hotpink", "tomato", "blueviolet", "yellow"),
oceanCol = "dodgerblue"
)
```
We can compare this partition with the one we obtain using PAM (K-medoids),
which is implemented in the function `pam` of the package `cluster`. Recall
from the discussion in class that `pam` does not need to manipulate the
actual observations, only its pairwise distances (or dissimilarities). In
this case we use Euclidean distances, but it may be interesting to
explore other distances, particulary in light of the categorical
nature of the data. Furthermore, to obtain clusters that may
be easier to interpret we use `K = 3`:
```{r un.pam, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
library(cluster)
# Use Euclidean distances
d <- dist(t(X))
# what happens with missing values?
set.seed(123)
a <- pam(d, k = 3)
```
Compare the resulting groups with those of *K-means*:
```{r un.pam2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
b <- kmeans(t(X2), centers = 3, iter.max = 20, nstart = 1000)
table(a$clustering)
table(b$cluster)
```
An better visualization is done using the map. interesting
We plot the 3 groups found by `pam` on the map, followed by those
found by K-means:
```{r un.pam.map, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
these <- countrycode(colnames(X), "country.name", "iso3c")
malDF <- data.frame(country = these, cluster = a$clustering)
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3", nameJoinColumn = "country")
par(mai = c(0, 0, 0, 0), xaxs = "i", yaxs = "i")
mapCountryData(malMap,
nameColumnToPlot = "cluster", catMethod = "categorical",
missingCountryCol = "white", addLegend = FALSE, mapTitle = "",
colourPalette = c("darkgreen", "hotpink", "tomato", "blueviolet", "yellow"), oceanCol = "dodgerblue"
)
these <- countrycode(colnames(X2), "country.name", "iso3c")
malDF <- data.frame(country = these, cluster = b$cluster)
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3", nameJoinColumn = "country")
par(mai = c(0, 0, 0, 0), xaxs = "i", yaxs = "i")
mapCountryData(malMap,
nameColumnToPlot = "cluster", catMethod = "categorical",
missingCountryCol = "white", addLegend = FALSE, mapTitle = "",
colourPalette = c("yellow", "tomato", "blueviolet"),
oceanCol = "dodgerblue"
)
```
What if we use the L_1 norm instead?
```{r un.pam.inf, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
d <- dist(t(X), method = "manhattan")
set.seed(123)
a <- pam(d, k = 3)
these <- countrycode(colnames(X), "country.name", "iso3c")
malDF <- data.frame(country = these, cluster = a$clustering)
malMap <- joinCountryData2Map(malDF, joinCode = "ISO3", nameJoinColumn = "country")
par(mai = c(0, 0, 0, 0), xaxs = "i", yaxs = "i")
mapCountryData(malMap,
nameColumnToPlot = "cluster", catMethod = "categorical",
missingCountryCol = "white", addLegend = FALSE, mapTitle = "",
colourPalette = c("darkgreen", "hotpink", "tomato", "blueviolet", "yellow"),
oceanCol = "dodgerblue"
)
```
As mentioned before, since the data set does not include a *true label*, the
comparison between the different results is somewhat subjective, and it often
relies on the knowledge of the subject matter experts. In our example above, this
would mean asking the opinion of a political scientist as to whether these
groupings correspond to known international political blocks or alignments.
### Breweries
In this example beer drinkers were asked to rate 9 breweries one 26
attributes, e.g. whether this brewery has a rich tradition; or
whether it makes very good pilsner beer, etc. For each of these
questions, the *judges* reported a score on a 6-point scale
ranging from 1: "not true at all" to 6: "very true". The data
are in the file `breweries.dat`:
```{r brew, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
x <- read.table("data/breweries.dat", header = FALSE)
x <- t(x)
```
For illustration purposes we use the $L_1$ distance and the
PAM clustering method.
```{r brew2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
d <- dist(x, method = "manhattan")
set.seed(123)
a <- pam(d, k = 3)
table(a$clustering)
```
To visualize the strength of these cluster partition we use
the silhouette plot discussed in class:
```{r brew.plot, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
plot(a)
```
Since other distances may produce different partitions, an interesting
exercise would be to compare the above clusters with those found using
the Euclidean or $L_\infty$ norms, for example.
### Cancer example
This data contains gene expression levels for
6830 genes (rows) for 64 cell samples (columns).
More information can be found here:
[http://genome-www.stanford.edu/nci60/](http://genome-www.stanford.edu/nci60/).
The data are included in the `ElemStatLearn` package, and also
available on-line:
[https://web.stanford.edu/~hastie/ElemStatLearn/](https://web.stanford.edu/~hastie/ElemStatLearn/).
We will use K-means to identify 8 possible clusters among the 64 cell samples.
As discussed in class this exercise can (perhaps more interestingly) be formulated
in terms of *feature selection*. We load the data and use K-means to find 8 clusters:
```{r nci, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
data(nci, package = "ElemStatLearn")
ncit <- t(nci)
set.seed(31)
a <- kmeans(ncit, centers = 8, iter.max = 5000, nstart = 100)
table(a$cluster)
```
Note that in this application we do know the group to which each
observation belongs (its cancer type). We can look at the cancer types that
have been grouped together in each of the 8 clusters:
```{r nci2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE}
sapply(split(colnames(nci), a$cluster), table)
```
Note that clusters 3, 4, 6 and 7 are dominated by one type of cancer.
Similarly, almost all melanoma and renal samples are in clusters 7 and 8,
respectively, while all CNS samples are in cluster 4. Cluster 5 is
harder to interpret. Although all but one ovarian cancer samples are in this
cluster, it also contains 2/3 of the NSCLC samples. It may be of interest
to compare these results with those using different numbers of clusters.
<!-- Finally, we will compare this partition (obtained with the usual K-means algorithm -->
<!-- started from `100` random initial configurations) with the one -->
<!-- found by K++ as implemented in `flexclust::kcca`: -->
<!-- ```{r nci1.kpp, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE} -->
<!-- library(flexclust) -->
<!-- set.seed(123) -->
<!-- kpp <- kcca(ncit, k = 8, family = kccaFamily("kmeans")) -->
<!-- table(kpp@cluster) -->
<!-- ``` -->
<!-- while the one obtained with `kmeans` was -->
<!-- ```{r nci2.kpp, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE} -->
<!-- table(a$cluster) -->
<!-- ``` -->
<!-- Note that the two partitions are quite different. Furthermore, note that -->
<!-- there is still a lot of randomness in the solution returned by K++. If we run it again -->
<!-- with a different pseudo-random number generating seed we obtain a very different -->
<!-- partition: -->
<!-- ```{r nci3.kpp, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, tidy=TRUE} -->
<!-- set.seed(987) -->
<!-- table(kcca(ncit, k = 8, family = kccaFamily("kmeans"), control = list(initcent = "kmeanspp"))@cluster) -->
<!-- ``` -->
<!-- A very good exercise would be -->
<!-- to determine which of these different partitions -->
<!-- (the ones found with K++ and the one returned by 100 random starts of -->
<!-- K-means) is better in terms -->
<!-- of the objective function being optimized to find the clusters. -->