Today we will dive into unsupervised learning and experiment with clustering. k-means(++) is a popular choice (maybe even the most widely used clustering algorithm) when it comes to clustering data. Therefore, we will explore this algorithm a bit more in the following three tasks. In this exercise, we will continue to use sklearn
, which implements a variety of clustering algorithms in the sklearn.cluster
package.
The goal of this exercise part is to peek under the hood of this algorithm in order to empirically explore its strengths and weaknesses. Initially, we will use synthetic data to develop a basic understanding of the algorithm's performance characteristics.
In the literature on cluster analysis, k-means often refers not only to the clustering algorithm but also to the underlying optimization problem:
In cases where k-means refers to the problem formulation, the algorithm itself is sometimes called the Lloyd's algorithm. The algorithm repeats two steps until it converges:
-
Assign each data point to its nearest cluster center based on the squared Euclidean norm.
-
Update the centers by calculating the mean of each cluster center and using it as the new cluster center.
This algorithm will always converge and find a solution. However, there is no guarantee that this solution is the best solution, and the algorithm may converge slowly. Also, the algorithm requires an initial guess for the cluster centers. This is usually done by randomly selecting some of the points to be the initial centers. Therefore, it is good practice to run the algorithm several times with different initial center guesses to find a solution that is hopefully close to the best solution.
Fortunately, the implementation of k-means in sklearn
takes care of all these details and provides us with a simple interface to control all these aspects.
Navigate to src/ex1_kmeans.py
. Implement the first part of the plot_kmeans_clustering
function as follows:
- Load the input data from the given path. You can now run the file and examine the data.
- k-means clustering is scale sensitive. This means that we generally need to rescale our input data before performing clustering. Note that our
plot_kmeans_clustering
function has astandardize
parameter that is set toFalse
by default. Standardize the data according to$x_i = \frac{x_i - \mu}{\sigma}$ where$\mu$ is the sample mean,$\sigma$ is the sample standard deviation, in case thatstandardize
is set toTrue
.sklearn.preprocessing
may be helpful.
Now we want to perform k-means clustering. Implement the perform_kmeans_clustering
function following these steps:
-
Use
sklearn.cluster.KMeans
to train on the given data. Set the parameterinit
, which controls the initialization of the cluster centers, torandom
. There is a better way to set this value, but we will discuss that in Task 3. -
Retrieve the cluster centers and predict the cluster index for each point.
-
Return the inertia as a float, the cluster centers and the predicted cluster indices as an array each.
Go back to the plot_kmeans_clustering
function and finish the remaining TODOs:
-
Call the
perform_kmeans_clustering
function three times. Visualize the data points, cluster centers and the assignment of data points to cluster centers in a single scatter plot. To do this, use a for-loop andscatter_clusters_2d
to plot all the results in one plot (have a closer look at matplotlib subplots and axes). -
Return the figure object.
-
Now have a look at the
main
function. The default number$k$ of clusters is not optimal. Experiment with different values and set the number of k-means clusters you want to use.
Note: Data preprocessing benefits greatly from expert knowledge of the field/application in which the data was measured. Some preprocessing methods may not be applicable in certain settings.
Recall that k-means assigns a given point
Each cell in this diagram is the set of points which are closest to a center:
A Voronoi diagram can be used as a tool to visualize the boundaries of the k-means cluster, but is also useful as a tool to understand the algorithm.
- Navigate to the
plot_decision_boundaries
function and load, preprocess and cluster the synthetic data using the functionperform_kmeans_clustering
again. - Use
Voronoi
andvoronoi_plot_2d
from thescipy.spatial
package to visualize the boundaries of the k-means clusters. Again use theax
object of the plot andscatter_clusters_2d
. - Test your code with the test framework of vscode or by typing
nox -r -s test
in your terminal. - (Optional) Which assumptions/limitations of the k-Means algorithm are illustrated by this visualization?
A common application of clustering is data compression, where large amounts of data are reduced by extracting and keeping only the "important" aspects (i.e., cluster centers, covariance matrices and weights). You may want to use this technique if you cannot store or transmit all the measurements. It can also be useful to reduce the amount of data if a tool/function you want to use in your analysis has a runtime that makes it infeasible to use on thousands or millions of data points. Sometimes, k-means takes a long time to perform the clustering (although you can apply it to large datasets). If you want to speed things up, you can consider using MiniBatchKMeans
, which performs k-means clustering by drawing multiple random subsets of the entire data.
In this task the goal is to reduce the storage requirement of an image with width
- Open the file
src/ex2_image_compression.py
. The image is loaded in themain
function using theload_image
function. Inspect theinput_img
variable and print the information about its dimensions.
Implement the compress_colorspace
function using the k-means algorithm:
-
Reshape the input image into
$(w\cdot h, 3)$ to perform clustering on colors. -
Use
MiniBatchKMeans
to cluster the image into$k$ clusters. -
Return a compressed image where the number of unique colors where reduced from
$256^3$ to$k$ via k-means clustering. The compressed image must have the same shape as the original one. -
Use
compress_colorspace
in yourmain
function to compress the image for$k \in {2,8,64,256}$ and plot the result using imshow. Set the corresponding value of$k$ as title for each result. -
Test your code with the test framework of vscode or by typing
nox -r -s test
in your terminal.
As mentioned above, Lloyd's algorithm requires an initial set of centers. Looking more closely at the sklearn documentation, the second argument allows us to use either uniformly randomly selected points or something called "kmeans++" or user-defined array as the initial set of centers. The main contribution of k-means++ is a clever strategy for choosing the initial centers:
- Choose a point
$x_1 \in P$ uniformly at random, set$C^1 = { x_1 }$ . -
for
$i = 0$ to$k$ do: -
$\qquad$ Draw a point$x_i \in P$ according to the probability distribution
-
$\qquad$ Set$C^{i} = C^{i-1} \cup {x_i}$ . - end for
Navigate into src/ex3_kmeans_plus_plus.py
and have a look at the code.
- Implement the
uniform_sampling
function by drawing points uniformly from the datasets. - Implement the
d2_sampling
function using the$D^2$ sampling algorithm described above. - Compare the results on the two datasets by executing the scirpt with
python ./src/ex3_kmeans_plus_plus.py
. Which advantages does$D^2$ sampling provide as an initialization? Hint: (Weighted) sampling with and without replacement can be performed usingnp.random.choice
. - Test your code with the test framework of vscode or by typing
nox -r -s test
in your terminal.
Navigate into src/ex4_gmm.py
and have a look at the code. We are creating synthetic dataset with three classes (the same that we used in the lecture) an want to compare k-means clustering and GMMs. If you want, you can take the diabetes dataset from Day 04. Implement the TODOs in the file.