You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: the order of the questions might be different for each student. Regression questions are excluded (S23). The answers here should be used as guidelines and you should consult the textbook for comprehensive explanations of each concept.
Written - Clustering
Describe a case where clustering would be an appropriate tool. In your description, be sure to give examples of variables you would have in the data set (and what each row in the data set represents). Also describe what insight it would bring from the data. Your answer should be 2-3 sentences long and in your own words.
Answer
Suppose Netflix collects data about user watch time. Variables could include: minutes watched per day, number of sessions per week, and number of unique shows in a year. A data scientist at Netflix can then use clustering to separate users into groups based on this viewership information. They can gain insights about the characteristics of users who tend to watch more/less compared to others.
K-means clustering was performed on a data set for K's from 1 to 9. Given the elbow plot below, choose the best K for this data set? Explain your answer in 1-2 sentences.
Answer
The best K seems to be 2. The Total WSSD levels off after K=2, and it looks like the elbow on the plot.
Given the faithful data set in R (previewed below), fill in the blanks in the code below to perform Kmeans clustering with K = 2. Modify the code to ensure that the K-means analysis below is reproducible.
In your own word, explain what a point estimate is and why they are useful? Your answer should be 2-3 sentences long and in your own words.
Answer
A point estimate is a value we compute from a sample that gives us an estimate of a population parameter. If we take multiple samples of the same size from our population and compute a point estimate for each sample, we can obtain a sampling distribution. We can then use the sampling distribution constructed by our estimates to report how confident we are in our estimate.
In your own words, briefly describe the purpose of boostrapping in inference.
Answer
We use bootstrapping because most of the time it is not feasible (too expensive, too resource-intensive) to obtain multiple samples from our population.
Describe how to compute the total within-cluster sum of squared distances (total WSSD) in K-means clustering, and what it is used for. Answer in 2-3 sentences in your own words.
Answer
The total WSSD is the sum of the squared distances between each data point and its cluster centroid. In K-means clustering, the WSSD is used to either select the best clustering for a particular value of $k$, or to plot an elbow plot to select the best value of $k$.
In your own words, describe the K-means clustering algorithm, including all of its major steps.
Answer
K-means is an iterative procedure. First, initialize $k$ random centroids. Assign data points to centroids based on whichever one is closest (by Euclidean distance). Re-compute centroids using the center of each newly assigned cluster of data points. Repeat this process until the centroids do not move anymore.
Due to the random initialization of the $k$ centroids in the first step, we typically repeat this process several times to avoid obtaining a particularly bad clustering by chance.
Note: the order of the questions might be different for each student. Regression questions are excluded (S23). The answers here should be used as guidelines and you should consult the textbook for comprehensive explanations of each concept.
Written - Clustering
Describe a case where clustering would be an appropriate tool. In your description, be sure to give examples of variables you would have in the data set (and what each row in the data set represents). Also describe what insight it would bring from the data. Your answer should be 2-3 sentences long and in your own words.
Answer
Suppose Netflix collects data about user watch time. Variables could include: minutes watched per day, number of sessions per week, and number of unique shows in a year. A data scientist at Netflix can then use clustering to separate users into groups based on this viewership information. They can gain insights about the characteristics of users who tend to watch more/less compared to others.
Textbook: See
palmerpenguins
example. https://datasciencebook.ca/clustering.html#clustering-1Written - K-means Elbow Plot
K-means clustering was performed on a data set for K's from 1 to 9. Given the elbow plot below, choose the best K for this data set? Explain your answer in 1-2 sentences.
Answer
The best K seems to be 2. The Total WSSD levels off after K=2, and it looks like the elbow on the plot.
Textbook: https://datasciencebook.ca/clustering.html#choosing-k
Coding - Faithful dataset
Given the faithful data set in R (previewed below), fill in the blanks in the code below to perform Kmeans clustering with K = 2. Modify the code to ensure that the K-means analysis below is reproducible.
Answer
Textbook: https://datasciencebook.ca/clustering.html#k-means-in-r
Written - Explain point estimate
In your own word, explain what a point estimate is and why they are useful? Your answer should be 2-3 sentences long and in your own words.
Answer
A point estimate is a value we compute from a sample that gives us an estimate of a population parameter. If we take multiple samples of the same size from our population and compute a point estimate for each sample, we can obtain a sampling distribution. We can then use the sampling distribution constructed by our estimates to report how confident we are in our estimate.
Textbook: https://datasciencebook.ca/inference.html#why-do-we-need-sampling
Written - Explain bootstrapping
In your own words, explain the bootstrapping process.
Answer
Textbook: https://datasciencebook.ca/inference.html#bootstrapping
Written - Purpose of bootstrapping
In your own words, briefly describe the purpose of boostrapping in inference.
Answer
We use bootstrapping because most of the time it is not feasible (too expensive, too resource-intensive) to obtain multiple samples from our population.
Textbook: https://datasciencebook.ca/inference.html#bootstrapping
Written - Explain WSSD
Describe how to compute the total within-cluster sum of squared distances (total WSSD) in K-means clustering, and what it is used for. Answer in 2-3 sentences in your own words.
Answer$k$ , or to plot an elbow plot to select the best value of $k$ .
The total WSSD is the sum of the squared distances between each data point and its cluster centroid. In K-means clustering, the WSSD is used to either select the best clustering for a particular value of
Textbook: https://datasciencebook.ca/clustering.html#measuring-cluster-quality
Written - Explain K-means
In your own words, describe the K-means clustering algorithm, including all of its major steps.
Answer$k$ random centroids. Assign data points to centroids based on whichever one is closest (by Euclidean distance). Re-compute centroids using the center of each newly assigned cluster of data points. Repeat this process until the centroids do not move anymore.
K-means is an iterative procedure. First, initialize
Due to the random initialization of the$k$ centroids in the first step, we typically repeat this process several times to avoid obtaining a particularly bad clustering by chance.
Textbook: https://datasciencebook.ca/clustering.html#k-means
The text was updated successfully, but these errors were encountered: