diff --git a/.nojekyll b/.nojekyll
index d7bdd33..d1765a9 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-a4dd7030
\ No newline at end of file
+2666adb4
\ No newline at end of file
diff --git a/schedule/index.html b/schedule/index.html
index 922d2fa..71138a5 100644
--- a/schedule/index.html
+++ b/schedule/index.html
@@ -484,22 +484,22 @@
If we knew how to rotate our data, then we could more easily retain the structure.
+
PCA gives us exactly this rotation
+
PCA works when the data can be represented (in a lower dimension) as lines (or planes, or hyperplanes).
+
So, in two dimensions:
+
+
+
+
PCA reduced
+
Here, we can capture a lot of the variation and underlying structure with just 1 dimension,
+
instead of the original 2 (the colouring is for visualizing).
+
+
+
+
PCA bad
+
What about other data structures? Again in two dimensions
+
+
+
+
PCA bad
+
Here, we have failed miserably.
+
There is actually only 1 dimension to this data (imagine walking up the spiral going from purple to yellow).
+
However, when we write it as 1 PCA dimension, all the points are all “mixed up”.
+
+
+
+
Explanation
+
+
+
PCA wants to minimize distances (equivalently maximize variance).
+
This means it slices through the data at the meatiest point, and then the next one, and so on.
+
If the data are curved this is going to induce artifacts.
+
PCA also looks at things as being close if they are near each other in a Euclidean sense.
+
On the spiral, our intuition says that things are close only if the distance is constrained to go along the curve.
+
In other words, purple and blue are close, blue and yellow are not.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Kernel PCA
+
Classical PCA comes from \(\X= \U\D\V^{\top}\), the SVD of the (centered) data
+
However, we can just as easily get it from the outer product \(\mathbf{K} = \X\X^{\top} = \U\D^2\U^{\top}\)
+
The intuition behind KPCA is that \(\mathbf{K}\) is an expansion into a kernel space, where \[\mathbf{K}_{i,i'} = k(x_i,\ x_{i'}) = \langle x_i,x_{i'} \rangle\]
+
We saw this trick before with feature expansion.
+
+
+
Procedure
+
+
Specify a kernel function \(k\)
+many people use \(k(x,x') = \exp\left( -d(x, x')/\gamma\right)\) where \(d(x,x') = \norm{x-x'}_2^2\)
+
Form \(\mathbf{K}_{i,i'} = k(x_i,x_{i'})\)
+
Double center \(\mathbf{K} = \mathbf{PKP}\) where \(\mathbf{P} = \mathbf{I}_n - \mathbf{11}^\top / n\)
+
Take eigendecomposition \(\mathbf{K} = \U\D^2\U^{\top}\)
+
+
The scores are still \(\mathbf{Z} = \U_M\D_M\)
+
+
+
+
+
+
+
Note
+
+
+
We don’t explicitly generate the feature map \(\longrightarrow\) there are NO loadings
+
+
+
+
+
+
An alternate view
+
To get the first PC in classical PCA, we solve \[\max_\alpha \Var{\X\alpha} \quad \textrm{ subject to } \quad \left|\left| \alpha \right|\right|_2^2 = 1\]
+
In the kernel setting we solve \[\max_{g \in \mathcal{H}_k} \Var{g(X)} \quad \textrm{ subject to } \quad\left|\left| g \right|\right|_{\mathcal{H}_k} = 1\]
+
Here \(\mathcal{H}_k\) is a function space determined by \(k(x,x')\).
+
+
+
Example kernels
+
+
\(k(x,x') = x^\top x'\)
+
+gives back regular PCA
+
+
\(k(x,x') = (1+x^\top x')^d\)
+
+gives a function space which contains all \(d^{th}\)-order
+
+
+polynomials.
+
+
\(k(x,x') = \exp(-\norm{x-x'}_2^2/\gamma)\)
+
+gives a function space spanned by the infinite Fourier basis
+
s <-svd(X) # use svd
+pca_loadings <- s$v[, 1:M]
+pca_scores <- X %*% pca_loadings
+
+
+
s <-eigen(t(X) %*% X) # V D^2 V'
+pca_loadings <- s$vectors[, 1:M]
+pca_scores <- X %*% pca_loadings
+
+
+
s <-eigen(X %*%t(X)) # U D^2 U'
+D <-sqrt(diag(s$values[1:M]))
+U <- s$vectors[, 1:M]
+pca_scores <- U %*% D
+pca_loadings <- (1/ D) %*%t(U) %*% X
+
+
+
+
KPCA:
+
+
d <-2
+K <- P %*% (1+ X %*%t(X))^d %*% P # polynomial
+e <-eigen(K) # U D^2 U'
+# (different from the PCA one, K /= XX')
+U <- e$vectors[, 1:M]
+D <-diag(sqrt(e$values[1:M]))
+kpca_poly <- U %*% D
+
+
+
K <- P %*%tanh(1+ X %*%t(X)) %*% P # sigmoid kernel
+e <-eigen(K) # U D^2 U'
+# (different from the PCA one, K /= XX')
+U <- e$vectors[, 1:M]
+D <-diag(sqrt(e$values[1:M]))
+kpca_sigmoid <- U %*% D
+
+
+
+
+
+
Plotting
+
+
+
+
PCA loadings
+
Showing the first 10 PCA loadings:
+
+
First column are the weights on the first score
+
each number corresponds to a variable in the original data
+
How much does that variable contribute to that score?
So far, we’ve looked at ways of reducing the dimension.
+
Either linearly or nonlinearly,
+
+
The goal is visualization/exploration or possibly for an input to supervised learning.
+
+
Now we try to find groups or clusters in our data.
+
Think of clustering as classification without the labels.
+
+
+
K-means (ideally)
+
+
Select a number of clusters \(K\).
+
Let \(C_1,\ldots,C_K\) partition \(\{1,2,3,\ldots,n\}\) such that
+
+
All observations belong to some set \(C_k\).
+
No observation belongs to more than one set.
+
+
Make within-cluster variation, \(W(C_k)\), as small as possible. \[\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).\]
+
Define \(W\) as \[W(C_k) = \frac{1}{2|C_k|} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2.\] That is, the average (Euclidean) distance between all cluster members.
+
+
+
To work, K-means needs distance to a center and notion of center
+
+
+
+
Why this formula?
+
Let \(\overline{x}_k = \frac{1}{|C_k|} \sum_{i\in C_k} x_i\)
If you wanted (equivalently) to minimize \(\sum_{k=1}^K \frac{1}{|C_k|} \sum_{x \in C_k} \norm{x-\overline{x}_k}^2_2\), then you’d use \(\sum_{k=1}^K \frac{1}{\binom{C_k}{2}} \sum_{i, i' \in C_k} \norm{x_i - x_{i'}}_2^2\)
+
+
+
K-means (in reality)
+
This is too computationally challenging ( \(K^n\) partions! ) \[\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k).\] So, we make a greedy approximation:
+
+
Randomly assign observations to the \(K\) clusters
+
Iterate the following:
+
+
For each cluster, compute the \(p\)-length vector of the means in that cluster.
+
Assign each observation to the cluster whose centroid is closest (in Euclidean distance).
+
+
+
This procedure is guaranteed to decrease \(\sum_{k=1}^K W(C_k)\) at each step.
+
+
+
Best practices
+
To fit K-means, you need to
+
+
Pick \(K\) (inherent in the method)
+
Convince yourself you have found a good solution (due to the randomized / greedy algorithm).
+
+
For 2., run K-means many times with different starting points. Pick the solution that has the smallest value for \[\sum_{k=1}^K W(C_k)\]
+
It turns out that 1. is difficult to do in a principled way.
+
+
+
Choosing the Number of Clusters
+
Why is it important?
+
+
It might make a big difference (concluding there are \(K = 2\) cancer sub-types versus \(K = 3\)).
+
One of the major goals of statistical learning is automatic inference. A good way of choosing \(K\) is certainly a part of this.
Final clustering assignments depend on the chosen initial cluster centers.
+
+
Hierarchical clustering
+
+
No need to choose the number of clusters before hand.
+
There is no random component (nor choice of starting point).
+
+
There is a catch: we need to choose a way to measure the distance between clusters, called the linkage.
+
+
+
Same data as the K-means example:
+
+
+Code
+
# same data as K-means "Dumb example"
+heatmaply::ggheatmap(
+as.matrix(dist(rbind(X1, X2, X3))),
+showticklabels =c(FALSE, FALSE), hide_colorbar =TRUE
+)
+
+
+
+
+
+
+
+
+
+
+
+
Hierarchical clustering
+
+
+
+
+
+
+
+
+
+
+
+
+
Given the linkage, hierarchical clustering produces a sequence of clustering assignments.
+
At one end, all points are in their own cluster.
+
At the other, all points are in one cluster.
+
In the middle, there are nontrivial solutions.
+
+
+
+
+
+
Agglomeration
+
+
+
+
+
+
+
+
+
+
+
Given these data points, an agglomerative algorithm chooses a cluster sequence by combining the points into groups.
+
We can also represent the sequence of clustering assignments as a dendrogram
+
Cutting the dendrogram horizontally partitions the data points into clusters
+
+
+
+
+
Notation: Define \(x_1,\ldots, x_n\) to be the data
+
Let the dissimiliarities be \(d_{ij}\) between each pair \(x_i, x_j\)
+
At any level, clustering assignments can be expressed by sets \(G = \{ i_1, i_2, \ldots, i_r\}\) giving the indicies of points in this group. Define \(|G|\) to be the size of \(G\).
+
+
+
Linkage
+
+The function \(d(G,H)\) that takes two groups \(G,\ H\) and returns the linkage distance between them.
+
+
+
+
+
+
+
Agglomerative clustering, given the linkage
+
+
Start with each point in its own group
+
Until there is only one cluster, repeatedly merge the two groups \(G,H\) that minimize \(d(G,H)\).
+
+
+
+
+
+
+
+
Important
+
+
+
\(d\) measures the distance between GROUPS.
+
+
+
+
+
+
Single linkage
+
In single linkage (a.k.a nearest-neighbor linkage), the linkage distance between \(G,\ H\) is the smallest dissimilarity between two points in different groups: \[d_{\textrm{single}}(G,H) = \min_{i \in G, \, j \in H} d_{ij}\]
+
+
+
+
Complete linkage
+
In complete linkage (i.e. farthest-neighbor linkage), linkage distance between \(G,H\) is the largest dissimilarity between two points in different clusters: \[d_{\textrm{complete}}(G,H) = \max_{i \in G,\, j \in H} d_{ij}.\]
+
+
+
+
Average linkage
+
In average linkage, the linkage distance between \(G,H\) is the average dissimilarity over all points in different clusters: \[d_{\textrm{average}}(G,H) = \frac{1}{|G| \cdot |H| }\sum_{i \in G, \,j \in H} d_{ij}.\]
+
+
+
+
Common properties
+
Single, complete, and average linkage share the following:
+
+
They all operate on the dissimilarities \(d_{ij}\).
+
This means that the points we are clustering can be quite general (number of mutations on a genome, polygons, faces, whatever).
+
Running agglomerative clustering with any of these linkages produces a dendrogram with no inversions
+
“No inversions” means that the linkage distance between merged clusters only increases as we run the algorithm.
+
+
In other words, we can draw a proper dendrogram, where the height of a parent is always higher than the height of either daughter.
+
(We’ll return to this again shortly)
+
+
+
Centroid linkage
+
Centroid linkage is relatively new. We need \(x_i \in \mathbb{R}^p\).
+
\(\overline{x}_G\) and \(\overline{x}_H\) are group averages
… very related to average linkage (and much, much faster)
+
+
However, it may introduce inversions.
+
+
+
+
+
+
+
+
+
+
+
+Code
+
tt <-seq(0, 2* pi, len =50)
+tt2 <-seq(0, 2* pi, len =75)
+c1 <-tibble(x =cos(tt), y =sin(tt))
+c2 <-tibble(x =1.5*cos(tt2), y =1.5*sin(tt2))
+circles <-bind_rows(c1, c2)
+di <-dist(circles[, 1:2])
+hc <-hclust(di, method ="centroid")
+par(mar =c(.1, 5, 3, .1))
+plot(hc, xlab ="")
+
+
+
+
+
+
+
+
+
+
+
+
Shortcomings of some linkages
+
+
Single
+
+👎 chaining — a single pair of close points merges two clusters. \(\Rightarrow\) clusters can be too spread out, not compact
+
+
Complete linkage
+
+👎 crowding — a point can be closer to points in other clusters than to points in its own cluster.\(\Rightarrow\) clusters are compact, not far enough apart.
+
+
Average linkage
+
+tries to strike a balance these
+
+
+👎 Unclear what properties the resulting clusters have when we cut an average linkage tree.
+
+
+👎 Results change with a monotone increasing transformation of the dissimilarities
+
+
Centroid linkage
+
+👎 same monotonicity problem
+
+
+👎 and inversions
+
+
All linkages
+
+⁇ where do we cut?
+
+
+
+
+
Distances
+
Note how all the methods depend on the distance function
+
Can do lots of things besides Euclidean
+
This is very important
+
+
+
+
Next time…
+
No more slides. All done.
+
+
+
FINAL EXAM!!
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-10-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-10-1.svg
new file mode 100644
index 0000000..1823993
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-10-1.svg
@@ -0,0 +1,295 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-11-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-11-1.svg
new file mode 100644
index 0000000..6601629
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-11-1.svg
@@ -0,0 +1,815 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-12-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-12-1.svg
new file mode 100644
index 0000000..223c81f
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-12-1.svg
@@ -0,0 +1,295 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-13-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-13-1.svg
new file mode 100644
index 0000000..223c81f
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-13-1.svg
@@ -0,0 +1,295 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-2-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-2-1.svg
new file mode 100644
index 0000000..ad16efe
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-2-1.svg
@@ -0,0 +1,21900 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-3-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-3-1.svg
new file mode 100644
index 0000000..2c41167
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-3-1.svg
@@ -0,0 +1,21985 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-4-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-4-1.svg
new file mode 100644
index 0000000..ea16fb7
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-4-1.svg
@@ -0,0 +1,223 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-5-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-5-1.svg
new file mode 100644
index 0000000..18ca4fb
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-5-1.svg
@@ -0,0 +1,254 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-6-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-6-1.svg
new file mode 100644
index 0000000..bebd629
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-6-1.svg
@@ -0,0 +1,254 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-7-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-7-1.svg
new file mode 100644
index 0000000..33b7b1a
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-7-1.svg
@@ -0,0 +1,265 @@
+
+
diff --git a/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-8-1.svg b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-8-1.svg
new file mode 100644
index 0000000..e8c88ba
--- /dev/null
+++ b/schedule/slides/28-hclust_files/figure-revealjs/unnamed-chunk-8-1.svg
@@ -0,0 +1,256 @@
+
+
diff --git a/search.json b/search.json
index 85a422d..b25fef2 100644
--- a/search.json
+++ b/search.json
@@ -28,1797 +28,2315 @@
"text": "Let’s just recap THE RIGHT WAY.\n\nFor HW or Labs, always start on main.\nPull in the remote ⬇️ just to be sure everything is up-to-date.\nCreate a branch for your HW/Lab and switch to it. The name doesn’t matter, but it’s good practice to name in something meaningful (rather than something like stat406-lab-1 when you’re doing lab 4).\nOpen the HW/Lab .Rmd and click Knit. Make sure it works.\nDo the work, saving regularly. When you complete a section, Commit the file with a useful message (Push or Not).\nOnce you’re done, make sure that you have done the minimum number of Commits, push ⬆️ your .Rmd and the knitted .pdf.\nOpen a PR on GitHub and respond to the questions.\nMake sure that only the .Rmd and the .pdf for this HW/Lab are there. And Create Pull Request.\nOn your machine, switch the branch to main to prepare for the next HW/Lab.\n\nIf the TA asks for changes, just switch to the branch for this assignment, and make the requested changes. It’s all on your machine (even if the pdf disappears when you switch)."
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#meta-lecture",
- "href": "schedule/slides/22-nnets-estimation.html#meta-lecture",
+ "objectID": "schedule/slides/27-kmeans.html#meta-lecture",
+ "href": "schedule/slides/27-kmeans.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "22 Neural nets - estimation",
- "text": "22 Neural nets - estimation\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "27 K-means clustering",
+ "text": "27 K-means clustering\nStat 406\nDaniel J. McDonald\nLast modified – 01 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#neural-network-terms-again-t-hidden-layers-regression",
- "href": "schedule/slides/22-nnets-estimation.html#neural-network-terms-again-t-hidden-layers-regression",
+ "objectID": "schedule/slides/27-kmeans.html#clustering",
+ "href": "schedule/slides/27-kmeans.html#clustering",
"title": "UBC Stat406 2023W",
- "section": "Neural Network terms again (T hidden layers, regression)",
- "text": "Neural Network terms again (T hidden layers, regression)\n\n\n\\[\n\\begin{aligned}\nA_{k}^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_{\\ell}^{(t)} &= g\\left(\\sum_{k=1}^{K_{t-1}} w^{(t)}_{\\ell,k} A_{k}^{(t-1)} \\right)\\\\\n\\hat{Y} &= z_m = \\sum_{\\ell=1}^{K_T} \\beta_{m,\\ell} A_{\\ell}^{(T)}\\ \\ (M = 1)\n\\end{aligned}\n\\]\n\n\\(B \\in \\R^{M\\times K_T}\\).\n\\(M=1\\) for regression\n\n\\(\\mathbf{W}_t \\in \\R^{K_2\\times K_1}\\) \\(t=1,\\ldots,T\\)"
+ "section": "Clustering",
+ "text": "Clustering\nSo far, we’ve looked at ways of reducing the dimension.\nEither linearly or nonlinearly,\n\nThe goal is visualization/exploration or possibly for an input to supervised learning.\n\nNow we try to find groups or clusters in our data.\nThink of clustering as classification without the labels."
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#training-neural-networks.-first-choices",
- "href": "schedule/slides/22-nnets-estimation.html#training-neural-networks.-first-choices",
+ "objectID": "schedule/slides/27-kmeans.html#k-means-ideally",
+ "href": "schedule/slides/27-kmeans.html#k-means-ideally",
"title": "UBC Stat406 2023W",
- "section": "Training neural networks. First, choices",
- "text": "Training neural networks. First, choices\n\nChoose the architecture: how many layers, units per layer, what connections?\nChoose the loss: common choices (for each data point \\(i\\))\n\n\nRegression\n\n\\(\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2\\) (the 1/2 just makes the derivative nice)\n\nClassification\n\n\\(\\hat{R}_i = I(y_i = m)\\log( 1 + \\exp(-z_{im}))\\)\n\n\n\nChoose the activation function \\(g\\)"
+ "section": "K-means (ideally)",
+ "text": "K-means (ideally)\n\nSelect a number of clusters \\(K\\).\nLet \\(C_1,\\ldots,C_K\\) partition \\(\\{1,2,3,\\ldots,n\\}\\) such that\n\nAll observations belong to some set \\(C_k\\).\nNo observation belongs to more than one set.\n\nMake within-cluster variation, \\(W(C_k)\\), as small as possible. \\[\\min_{C_1,\\ldots,C_K} \\sum_{k=1}^K W(C_k).\\]\nDefine \\(W\\) as \\[W(C_k) = \\frac{1}{2|C_k|} \\sum_{i, i' \\in C_k} \\norm{x_i - x_{i'}}_2^2.\\] That is, the average (Euclidean) distance between all cluster members.\n\n\nTo work, K-means needs distance to a center and notion of center"
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#training-neural-networks-intuition",
- "href": "schedule/slides/22-nnets-estimation.html#training-neural-networks-intuition",
+ "objectID": "schedule/slides/27-kmeans.html#why-this-formula",
+ "href": "schedule/slides/27-kmeans.html#why-this-formula",
"title": "UBC Stat406 2023W",
- "section": "Training neural networks (intuition)",
- "text": "Training neural networks (intuition)\n\nWe need to estimate \\(B\\), \\(\\mathbf{W}_t\\), \\(t=1,\\ldots,T\\)\nWe want to minimize \\(\\hat{R} = \\sum_{i=1}^n \\hat{R}_i\\) as a function of all this.\nWe use gradient descent, but in this dialect, we call it back propagation\n\n\n\nDerivatives via the chain rule: computed by a forward and backward sweep\nAll the \\(g(u)\\)’s that get used have \\(g'(u)\\) “nice”.\nIf \\(g\\) is ReLu:\n\n\\(g(u) = xI(x>0)\\)\n\\(g'(u) = I(x>0)\\)\n\n\n\nOnce we have derivatives from backprop,\n\\[\n\\begin{align}\n\\widetilde{B} &\\leftarrow B - \\gamma \\frac{\\partial \\widehat{R}}{\\partial B}\\\\\n\\widetilde{\\mathbf{W}_t} &\\leftarrow \\mathbf{W}_t - \\gamma \\frac{\\partial \\widehat{R}}{\\partial \\mathbf{W}_t}\n\\end{align}\n\\]"
+ "section": "Why this formula?",
+ "text": "Why this formula?\nLet \\(\\overline{x}_k = \\frac{1}{|C_k|} \\sum_{i\\in C_k} x_i\\)\n\\[\n\\begin{aligned}\n\\sum_{k=1}^K W(C_k)\n&= \\sum_{k=1}^K \\frac{1}{2|C_k|} \\sum_{i, i' \\in C_k} \\norm{x_i - x_{i'}}_2^2\n= \\sum_{k=1}^K \\frac{1}{2|C_k|} \\sum_{i\\neq i' \\in C_k} \\norm{x_i - x_{i'}}_2^2 \\\\\n&= \\sum_{k=1}^K \\frac{1}{2|C_k|} \\sum_{i \\neq i' \\in C_k} \\norm{x_i -\\overline{x}_k + \\overline{x}_k - x_{i'}}_2^2\\\\\n&= \\sum_{k=1}^K \\frac{1}{2|C_k|} \\left[\\sum_{i \\neq i' \\in C_k} \\left(\\norm{x_i - \\overline{x}_k}_2^2 +\n\\norm{x_{i'} - \\overline{x}_k}_2^2\\right) + \\sum_{i \\neq i' \\in C_k} 2 (x_i-\\overline{x}_k)^\\top(\\overline{x}_k - x_{i'})\\right]\\\\\n&= \\sum_{k=1}^K \\frac{1}{2|C_k|} \\left[2(|C_k|-1)\\sum_{i \\in C_k} \\norm{x_i - \\overline{x}_k}_2^2 + 2\\sum_{i \\in C_k} \\norm{x_i - \\overline{x}_k}_2^2 \\right]\\\\\n&= \\sum_{k=1}^K \\sum_{x \\in C_k} \\norm{x-\\overline{x}_k}^2_2\n\\end{aligned}\n\\]\nIf you wanted (equivalently) to minimize \\(\\sum_{k=1}^K \\frac{1}{|C_k|} \\sum_{x \\in C_k} \\norm{x-\\overline{x}_k}^2_2\\), then you’d use \\(\\sum_{k=1}^K \\frac{1}{\\binom{C_k}{2}} \\sum_{i, i' \\in C_k} \\norm{x_i - x_{i'}}_2^2\\)"
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#chain-rule",
- "href": "schedule/slides/22-nnets-estimation.html#chain-rule",
+ "objectID": "schedule/slides/27-kmeans.html#k-means-in-reality",
+ "href": "schedule/slides/27-kmeans.html#k-means-in-reality",
"title": "UBC Stat406 2023W",
- "section": "Chain rule",
- "text": "Chain rule\nWe want \\(\\frac{\\partial}{\\partial B} \\hat{R}_i\\) and \\(\\frac{\\partial}{\\partial W_{t}}\\hat{R}_i\\) for all \\(t\\).\nRegression: \\(\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2\\)\n\\[\\begin{aligned}\n\\frac{\\partial\\hat{R}_i}{\\partial B} &= -(y_i - \\hat{y}_i)\\frac{\\partial \\hat{y_i}}{\\partial B} =\\underbrace{-(y_i - \\hat{y}_i)}_{-r_i} \\mathbf{A}^{(T)}\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_T} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_T} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_T}\\\\\n&= -\\left(r_i B \\odot g'(\\mathbf{W}_T \\mathbf{A}^{(T)}) \\right) \\left(\\mathbf{A}^{(T-1)}\\right)^\\top\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_{T-1}} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_{T-1}} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n&= -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T}}\\frac{\\partial \\mathbf{W}_{T}}{\\partial \\mathbf{A}^{(T-1)}}\\frac{\\partial \\mathbf{A}^{(T-1)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n\\cdots &= \\cdots\n\\end{aligned}\\]"
+ "section": "K-means (in reality)",
+ "text": "K-means (in reality)\nThis is too computationally challenging ( \\(K^n\\) partions! ) \\[\\min_{C_1,\\ldots,C_K} \\sum_{k=1}^K W(C_k).\\] So, we make a greedy approximation:\n\nRandomly assign observations to the \\(K\\) clusters\nIterate the following:\n\nFor each cluster, compute the \\(p\\)-length vector of the means in that cluster.\nAssign each observation to the cluster whose centroid is closest (in Euclidean distance).\n\n\nThis procedure is guaranteed to decrease \\(\\sum_{k=1}^K W(C_k)\\) at each step."
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#mapping-it-out",
- "href": "schedule/slides/22-nnets-estimation.html#mapping-it-out",
+ "objectID": "schedule/slides/27-kmeans.html#best-practices",
+ "href": "schedule/slides/27-kmeans.html#best-practices",
"title": "UBC Stat406 2023W",
- "section": "Mapping it out",
- "text": "Mapping it out\nGiven current \\(\\mathbf{W}_t, B\\), we want to get new, \\(\\widetilde{\\mathbf{W}}_t,\\ \\widetilde B\\) for \\(t=1,\\ldots,T\\)\n\nSquared error for regression, cross-entropy for classification\n\n\n\nFeed forward \n\\[\\mathbf{A}^{(0)} = \\mathbf{X} \\in \\R^{n\\times p}\\]\nRepeat, \\(t= 1,\\ldots, T\\)\n\n\\(\\mathbf{Z}_{t} = \\mathbf{A}^{(t-1)}\\mathbf{W}_t \\in \\R^{n\\times K_t}\\)\n\\(\\mathbf{A}^{(t)} = g(\\mathbf{Z}_{t})\\) (component wise)\n\\(\\dot{\\mathbf{A}}^{(t)} = g'(\\mathbf{Z}_t)\\)\n\n\\[\\begin{cases}\n\\hat{\\mathbf{y}} =\\mathbf{A}^{(T)} B \\in \\R^n \\\\\n\\hat{\\Pi} = \\left(1 + \\exp\\left(-\\mathbf{A}^{(T)}\\mathbf{B}\\right)\\right)^{-1} \\in \\R^{n \\times M}\\end{cases}\\]\n\n\nBack propogate \n\\[r = \\begin{cases}\n-\\left(\\mathbf{y} - \\widehat{\\mathbf{y}}\\right) \\\\\n-\\left(1 - \\widehat{\\Pi}\\right)[y]\\end{cases}\\]\n\\[\n\\begin{aligned}\n\\frac{\\partial}{\\partial \\mathbf{B}} \\widehat{R} &= \\left(\\mathbf{A}^{(T)}\\right)^\\top \\mathbf{r}\\\\\n\\boldsymbol{\\Gamma} &\\leftarrow \\mathbf{r}\\\\\n\\mathbf{W}_{T+1} &\\leftarrow \\mathbf{B}\n\\end{aligned}\n\\]\nRepeat, \\(t = T,...,1\\),\n\n\\(\\boldsymbol{\\Gamma} \\leftarrow \\left(\\boldsymbol{\\Gamma} \\mathbf{W}_{t+1}\\right) \\odot\\dot{\\mathbf{A}}^{(t)}\\)\n\\(\\frac{\\partial R}{\\partial \\mathbf{W}_t} =\\left(\\mathbf{A}^{(t)}\\right)^\\top \\Gamma\\)"
+ "section": "Best practices",
+ "text": "Best practices\nTo fit K-means, you need to\n\nPick \\(K\\) (inherent in the method)\nConvince yourself you have found a good solution (due to the randomized / greedy algorithm).\n\nFor 2., run K-means many times with different starting points. Pick the solution that has the smallest value for \\[\\sum_{k=1}^K W(C_k)\\]\nIt turns out that 1. is difficult to do in a principled way."
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#deep-nets",
- "href": "schedule/slides/22-nnets-estimation.html#deep-nets",
+ "objectID": "schedule/slides/27-kmeans.html#choosing-the-number-of-clusters",
+ "href": "schedule/slides/27-kmeans.html#choosing-the-number-of-clusters",
"title": "UBC Stat406 2023W",
- "section": "Deep nets",
- "text": "Deep nets\nSome comments on adding layers:\n\nIt has been shown that one hidden layer is sufficient to approximate any bounded piecewise continuous function\nHowever, this may take a huge number of hidden units (i.e. \\(K_1 \\gg 1\\)).\nThis is what people mean when they say that NNets are “universal approximators”\nBy including multiple layers, we can have fewer hidden units per layer.\nAlso, we can encode (in)dependencies that can speed computations\nWe don’t have to connect everything the way we have been"
+ "section": "Choosing the Number of Clusters",
+ "text": "Choosing the Number of Clusters\nWhy is it important?\n\nIt might make a big difference (concluding there are \\(K = 2\\) cancer sub-types versus \\(K = 3\\)).\nOne of the major goals of statistical learning is automatic inference. A good way of choosing \\(K\\) is certainly a part of this."
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#simple-example",
- "href": "schedule/slides/22-nnets-estimation.html#simple-example",
+ "objectID": "schedule/slides/27-kmeans.html#withinness-and-betweenness",
+ "href": "schedule/slides/27-kmeans.html#withinness-and-betweenness",
"title": "UBC Stat406 2023W",
- "section": "Simple example",
- "text": "Simple example\n\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestdata <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nnn_out <- neuralnet(y ~ x, data = df, hidden = c(10, 5, 15), threshold = 0.01, rep = 3)\nnn_preds <- map(1:3, ~ compute(nn_out, testdata, .x)$net.result)\nyhat <- nn_preds |> bind_cols() |> rowMeans() # average over the runs\n\n\n\nCode\n# This code will reproduce the analysis, takes some time\nset.seed(406406406)\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestx <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nlibrary(splines)\nfstar <- sin(1 / testx)\nspline_test_err <- function(k) {\n fit <- lm(y ~ bs(x, df = k), data = df)\n yhat <- predict(fit, newdata = tibble(x = testx))\n mean((yhat - fstar)^2)\n}\nKs <- 1:15 * 10\nSplineErr <- map_dbl(Ks, ~ spline_test_err(.x))\n\nJgrid <- c(5, 10, 15)\nNNerr <- double(length(Jgrid)^3)\nNNplot <- character(length(Jgrid)^3)\nsweep <- 0\nfor (J1 in Jgrid) {\n for (J2 in Jgrid) {\n for (J3 in Jgrid) {\n sweep <- sweep + 1\n NNplot[sweep] <- paste(J1, J2, J3, sep = \" \")\n nn_out <- neuralnet(y ~ x, df,\n hidden = c(J1, J2, J3),\n threshold = 0.01, rep = 3\n )\n nn_results <- sapply(1:3, function(x) {\n compute(nn_out, testx, x)$net.result\n })\n # Run them through the neural network\n Yhat <- rowMeans(nn_results)\n NNerr[sweep] <- mean((Yhat - fstar)^2)\n }\n }\n}\n\nbestK <- Ks[which.min(SplineErr)]\nbestspline <- predict(lm(y ~ bs(x, bestK), data = df), newdata = tibble(x = testx))\nbesthidden <- as.numeric(unlist(strsplit(NNplot[which.min(NNerr)], \" \")))\nnn_out <- neuralnet(y ~ x, df, hidden = besthidden, threshold = 0.01, rep = 3)\nnn_results <- sapply(1:3, function(x) compute(nn_out, testdata, x)$net.result)\n# Run them through the neural network\nbestnn <- rowMeans(nn_results)\nplotd <- data.frame(\n x = testdata, spline = bestspline, nnet = bestnn, truth = fstar\n)\nsave.image(file = \"data/nnet-example.Rdata\")"
+ "section": "Withinness and betweenness",
+ "text": "Withinness and betweenness\n\\[W(K) = \\sum_{k=1}^K W(C_k) = \\sum_{k=1}^K \\sum_{x \\in C_k} \\norm{x-\\overline{x}_k}^2_2,\\]\nWithin-cluster variation measures how tightly grouped the clusters are.\n\nIt’s opposite is Between-cluster variation: How spread apart are the clusters?\n\\[B(K) = \\sum_{k=1}^K |C_k| \\norm{\\overline{x}_k - \\overline{x} }_2^2,\\]\nwhere \\(|C_k|\\) is the number of points in \\(C_k\\), and \\(\\overline{x}\\) is the grand mean\n\n\n\\(W\\) when \\(K\\) \n\n\n\\(B\\) when \\(K\\)"
},
{
- "objectID": "schedule/slides/22-nnets-estimation.html#different-architectures",
- "href": "schedule/slides/22-nnets-estimation.html#different-architectures",
+ "objectID": "schedule/slides/27-kmeans.html#ch-index",
+ "href": "schedule/slides/27-kmeans.html#ch-index",
"title": "UBC Stat406 2023W",
- "section": "Different architectures",
- "text": "Different architectures"
+ "section": "CH index",
+ "text": "CH index\n\nWant small \\(W\\) and big \\(B\\)\n\n\\[\\textrm{CH}(K) = \\frac{B(K)/(K-1)}{W(K)/(n-K)}\\]\nTo choose \\(K\\), pick some maximum number of clusters to be considered, \\(K_{\\max} = 20\\), for example\n\\[\\widehat K = \\argmax_{K \\in \\{ 2,\\ldots, K_{\\max} \\}} CH(K).\\]\n\n\n\n\n\n\nNote\n\n\nCH is undefined for \\(K =1\\)"
},
{
- "objectID": "schedule/slides/20-boosting.html#meta-lecture",
- "href": "schedule/slides/20-boosting.html#meta-lecture",
+ "objectID": "schedule/slides/27-kmeans.html#dumb-example",
+ "href": "schedule/slides/27-kmeans.html#dumb-example",
"title": "UBC Stat406 2023W",
- "section": "20 Boosting",
- "text": "20 Boosting\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Dumb example",
+ "text": "Dumb example\n\nlibrary(mvtnorm)\nset.seed(406406406)\nX1 <- rmvnorm(50, c(-1, 2), sigma = matrix(c(1, .5, .5, 1), 2))\nX2 <- rmvnorm(40, c(2, -1), sigma = matrix(c(1.5, .5, .5, 1.5), 2))\nX3 <- rmvnorm(40, c(4, 4))"
},
{
- "objectID": "schedule/slides/20-boosting.html#last-time",
- "href": "schedule/slides/20-boosting.html#last-time",
+ "objectID": "schedule/slides/27-kmeans.html#dumb-example-1",
+ "href": "schedule/slides/27-kmeans.html#dumb-example-1",
"title": "UBC Stat406 2023W",
- "section": "Last time",
- "text": "Last time\nWe learned about bagging, for averaging low-bias / high-variance estimators.\nToday, we examine it’s opposite: Boosting.\nBoosting also combines estimators, but it combines high-bias / low-variance estimators.\nBoosting has a number of flavours. And if you Google descriptions, most are wrong.\nFor a deep (and accurate) treatment, see [ESL] Chapter 10\n\nWe’ll discuss 2 flavours: AdaBoost and Gradient Boosting\nNeither requires a tree, but that’s the typical usage.\nBoosting needs a “weak learner”, so small trees (stumps) are natural."
+ "section": "Dumb example",
+ "text": "Dumb example\n\nWe would maximize CH\n\n\n\nCode\nK <- 2:40\nN <- nrow(clust_raw)\nall_clusters <- map(K, ~ kmeans(clust_raw, .x, nstart = 20))\nall_assignments <- map_dfc(all_clusters, \"cluster\")\nnames(all_assignments) <- paste0(\"K = \", K)\nsummaries <- map_dfr(all_clusters, `[`, c(\"tot.withinss\", \"betweenss\")) %>%\n rename(W = tot.withinss, B = betweenss) %>%\n mutate(K = K, `CH index` = (B / (K - 1)) / (W / (N - K)))\nsummaries %>%\n pivot_longer(-K) %>%\n ggplot(aes(K, value)) +\n geom_line(color = blue, linewidth = 2) +\n ylab(\"\") +\n coord_cartesian(c(1, 20)) +\n facet_wrap(~name, ncol = 3, scales = \"free_y\")"
},
{
- "objectID": "schedule/slides/20-boosting.html#adaboost-intuition-for-classification",
- "href": "schedule/slides/20-boosting.html#adaboost-intuition-for-classification",
+ "objectID": "schedule/slides/27-kmeans.html#dumb-example-2",
+ "href": "schedule/slides/27-kmeans.html#dumb-example-2",
"title": "UBC Stat406 2023W",
- "section": "AdaBoost intuition (for classification)",
- "text": "AdaBoost intuition (for classification)\nAt each iteration, we weight the observations.\nObservations that are currently misclassified, get higher weights.\nSo on the next iteration, we’ll try harder to correctly classify our mistakes.\nThe number of iterations must be chosen."
+ "section": "Dumb example",
+ "text": "Dumb example"
},
{
- "objectID": "schedule/slides/20-boosting.html#adaboost-freund-and-schapire-generic",
- "href": "schedule/slides/20-boosting.html#adaboost-freund-and-schapire-generic",
+ "objectID": "schedule/slides/27-kmeans.html#dumb-example-3",
+ "href": "schedule/slides/27-kmeans.html#dumb-example-3",
"title": "UBC Stat406 2023W",
- "section": "AdaBoost (Freund and Schapire, generic)",
- "text": "AdaBoost (Freund and Schapire, generic)\nLet \\(G(x, \\theta)\\) be any weak learner\n⛭ imagine a tree with one split: then \\(\\theta=\\) (feature, split point)\nAlgorithm (AdaBoost) 🛠️\n\nSet observation weights \\(w_i=1/n\\).\nUntil we quit ( \\(m<M\\) iterations )\n\nEstimate the classifier \\(G(x,\\theta_m)\\) using weights \\(w_i\\)\nCalculate it’s weighted error \\(\\textrm{err}_m = \\sum_{i=1}^n w_i I(y_i \\neq G(x_i, \\theta_m)) / \\sum w_i\\)\nSet \\(\\alpha_m = \\log((1-\\textrm{err}_m)/\\text{err}_m)\\)\nUpdate \\(w_i \\leftarrow w_i \\exp(\\alpha_m I(y_i \\neq G(x_i,\\theta_m)))\\)\n\nFinal classifier is \\(G(x) = \\textrm{sign}\\left( \\sum_{m=1}^M \\alpha_m G(x, \\theta_m)\\right)\\)"
+ "section": "Dumb example",
+ "text": "Dumb example\n\n\\(K = 3\\)\n\n\nkm <- kmeans(clust_raw, 3, nstart = 20)\nnames(km)\n\n[1] \"cluster\" \"centers\" \"totss\" \"withinss\" \"tot.withinss\"\n[6] \"betweenss\" \"size\" \"iter\" \"ifault\" \n\ncenters <- as_tibble(km$centers, .name_repair = \"unique\")"
},
{
- "objectID": "schedule/slides/20-boosting.html#using-mobility-data-again",
- "href": "schedule/slides/20-boosting.html#using-mobility-data-again",
+ "objectID": "schedule/slides/25-pca-issues.html#meta-lecture",
+ "href": "schedule/slides/25-pca-issues.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Using mobility data again",
- "text": "Using mobility data again\n\n\nCode\nlibrary(kableExtra)\nlibrary(randomForest)\nmob <- Stat406::mobility |>\n mutate(mobile = as.factor(Mobility > .1)) |>\n select(-ID, -Name, -Mobility, -State) |>\n drop_na()\nn <- nrow(mob)\ntrainidx <- sample.int(n, floor(n * .75))\ntestidx <- setdiff(1:n, trainidx)\ntrain <- mob[trainidx, ]\ntest <- mob[testidx, ]\nrf <- randomForest(mobile ~ ., data = train)\nbag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)\npreds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))\n\n\n\n\nlibrary(gbm)\ntrain_boost <- train |>\n mutate(mobile = as.integer(mobile) - 1)\n# needs {0, 1} responses\ntest_boost <- test |>\n mutate(mobile = as.integer(mobile) - 1)\nadab <- gbm(\n mobile ~ .,\n data = train_boost,\n n.trees = 500,\n distribution = \"adaboost\"\n)\npreds$adab <- as.numeric(\n predict(adab, test_boost) > 0\n)\npar(mar = c(5, 11, 0, 1))\ns <- summary(adab, las = 1)"
+ "section": "25 Principal components, the troubles",
+ "text": "25 Principal components, the troubles\nStat 406\nDaniel J. McDonald\nLast modified – 01 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/20-boosting.html#forward-stagewise-additive-modeling-fsam-completely-generic",
- "href": "schedule/slides/20-boosting.html#forward-stagewise-additive-modeling-fsam-completely-generic",
+ "objectID": "schedule/slides/25-pca-issues.html#pca",
+ "href": "schedule/slides/25-pca-issues.html#pca",
"title": "UBC Stat406 2023W",
- "section": "Forward stagewise additive modeling (FSAM, completely generic)",
- "text": "Forward stagewise additive modeling (FSAM, completely generic)\nAlgorithm 🛠️\n\nSet initial predictor \\(f_0(x)=0\\)\nUntil we quit ( \\(m<M\\) iterations )\n\nCompute \\((\\beta_m, \\theta_m) = \\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i,\\ \\theta)\\right)\\)\nSet \\(f_m(x) = f_{m-1}(x) + \\beta_m G(x,\\ \\theta_m)\\)\n\nFinal classifier is \\(G(x, \\theta_M) = \\textrm{sign}\\left( f_M(x) \\right)\\)\n\nHere, \\(L\\) is a loss function that measures prediction accuracy\n\n\nIf (1) \\(L(y,\\ f(x))= \\exp(-y f(x))\\), (2) \\(G\\) is a classifier, and WLOG \\(y \\in \\{-1, 1\\}\\)\n\nFSAM is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000)."
+ "section": "PCA",
+ "text": "PCA\nIf we knew how to rotate our data, then we could more easily retain the structure.\nPCA gives us exactly this rotation\nPCA works when the data can be represented (in a lower dimension) as lines (or planes, or hyperplanes).\nSo, in two dimensions:"
},
{
- "objectID": "schedule/slides/20-boosting.html#so-what",
- "href": "schedule/slides/20-boosting.html#so-what",
+ "objectID": "schedule/slides/25-pca-issues.html#pca-reduced",
+ "href": "schedule/slides/25-pca-issues.html#pca-reduced",
"title": "UBC Stat406 2023W",
- "section": "So what?",
- "text": "So what?\nIt turns out that “exponential loss” \\(L(y,\\ f(x))= \\exp(-y f(x))\\) is not very robust.\nHere are some other loss functions for 2-class classification\n\n\nWant losses which penalize negative margin, but not positive margins.\nRobust means don’t over-penalize large negatives"
+ "section": "PCA reduced",
+ "text": "PCA reduced\nHere, we can capture a lot of the variation and underlying structure with just 1 dimension,\ninstead of the original 2 (the colouring is for visualizing)."
},
{
- "objectID": "schedule/slides/20-boosting.html#gradient-boosting",
- "href": "schedule/slides/20-boosting.html#gradient-boosting",
+ "objectID": "schedule/slides/25-pca-issues.html#pca-bad",
+ "href": "schedule/slides/25-pca-issues.html#pca-bad",
"title": "UBC Stat406 2023W",
- "section": "Gradient boosting",
- "text": "Gradient boosting\nIn the forward stagewise algorithm, we solved a minimization and then made an update:\n\\[f_m(x) = f_{m-1}(x) + \\beta_m G(x, \\theta_m)\\]\nFor most loss functions \\(L\\) / procedures \\(G\\) this optimization is difficult: \\[\\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i, \\theta)\\right)\\]\n💡 Just take one gradient step toward the minimum 💡\n\\[f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)\\]\nThis is called Gradient boosting\nNotice how similar the update steps look."
+ "section": "PCA bad",
+ "text": "PCA bad\nWhat about other data structures? Again in two dimensions"
},
{
- "objectID": "schedule/slides/20-boosting.html#gradient-boosting-1",
- "href": "schedule/slides/20-boosting.html#gradient-boosting-1",
+ "objectID": "schedule/slides/25-pca-issues.html#pca-bad-1",
+ "href": "schedule/slides/25-pca-issues.html#pca-bad-1",
"title": "UBC Stat406 2023W",
- "section": "Gradient boosting",
- "text": "Gradient boosting\n\\[f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)\\]\nGradient boosting goes only part of the way toward the minimum at each \\(m\\).\nThis has two advantages:\n\nSince we’re not fitting \\(\\beta, \\theta\\) to the data as “hard”, the learner is weaker.\nThis procedure is computationally much simpler.\n\nSimpler because we only require the gradient at one value, don’t have to fully optimize."
+ "section": "PCA bad",
+ "text": "PCA bad\nHere, we have failed miserably.\nThere is actually only 1 dimension to this data (imagine walking up the spiral going from purple to yellow).\nHowever, when we write it as 1 PCA dimension, all the points are all “mixed up”."
},
{
- "objectID": "schedule/slides/20-boosting.html#gradient-boosting-algorithm",
- "href": "schedule/slides/20-boosting.html#gradient-boosting-algorithm",
+ "objectID": "schedule/slides/25-pca-issues.html#explanation",
+ "href": "schedule/slides/25-pca-issues.html#explanation",
"title": "UBC Stat406 2023W",
- "section": "Gradient boosting – Algorithm 🛠️",
- "text": "Gradient boosting – Algorithm 🛠️\n\nSet initial predictor \\(f_0(x)=\\overline{\\y}\\)\nUntil we quit ( \\(m<M\\) iterations )\n\nCompute pseudo-residuals (what is the gradient of \\(L(y,f)=(y-f(x))^2\\)?) \\[r_i = -\\frac{\\partial L(y_i,f(x_i))}{\\partial f(x_i)}\\bigg|_{f(x_i)=f_{m-1}(x_i)}\\]\nEstimate weak learner, \\(G(x, \\theta_m)\\), with the training set \\(\\{r_i, x_i\\}\\).\nFind the step size \\(\\gamma_m = \\argmin_\\gamma \\sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \\gamma G(x_i, \\theta_m))\\)\nSet \\(f_m(x) = f_{m-1}(x) + \\gamma_m G(x, \\theta_m)\\)\n\nFinal predictor is \\(f_M(x)\\)."
+ "section": "Explanation",
+ "text": "Explanation\n\n\nPCA wants to minimize distances (equivalently maximize variance).\nThis means it slices through the data at the meatiest point, and then the next one, and so on.\nIf the data are curved this is going to induce artifacts.\nPCA also looks at things as being close if they are near each other in a Euclidean sense.\nOn the spiral, our intuition says that things are close only if the distance is constrained to go along the curve.\nIn other words, purple and blue are close, blue and yellow are not."
},
{
- "objectID": "schedule/slides/20-boosting.html#gradient-boosting-modifications",
- "href": "schedule/slides/20-boosting.html#gradient-boosting-modifications",
+ "objectID": "schedule/slides/25-pca-issues.html#kernel-pca",
+ "href": "schedule/slides/25-pca-issues.html#kernel-pca",
"title": "UBC Stat406 2023W",
- "section": "Gradient boosting modifications",
- "text": "Gradient boosting modifications\n\ngrad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = \"bernoulli\")\n\n\nTypically done with “small” trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors)\nUsually modify the gradient step to \\(f_m(x) = f_{m-1}(x) + \\gamma_m \\alpha G(x,\\theta_m)\\) with \\(0<\\alpha<1\\). Helps to keep from fitting too hard.\nOften combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options.\nThere are many other extensions, notably XGBoost."
+ "section": "Kernel PCA",
+ "text": "Kernel PCA\nClassical PCA comes from \\(\\X= \\U\\D\\V^{\\top}\\), the SVD of the (centered) data\nHowever, we can just as easily get it from the outer product \\(\\mathbf{K} = \\X\\X^{\\top} = \\U\\D^2\\U^{\\top}\\)\nThe intuition behind KPCA is that \\(\\mathbf{K}\\) is an expansion into a kernel space, where \\[\\mathbf{K}_{i,i'} = k(x_i,\\ x_{i'}) = \\langle x_i,x_{i'} \\rangle\\]\nWe saw this trick before with feature expansion."
},
{
- "objectID": "schedule/slides/20-boosting.html#results-for-mobility",
- "href": "schedule/slides/20-boosting.html#results-for-mobility",
+ "objectID": "schedule/slides/25-pca-issues.html#procedure",
+ "href": "schedule/slides/25-pca-issues.html#procedure",
"title": "UBC Stat406 2023W",
- "section": "Results for mobility",
- "text": "Results for mobility\n\n\nCode\nlibrary(cowplot)\nboost_preds <- tibble(\n adaboost = predict(adab, test_boost),\n gbm = predict(grad_boost, test_boost),\n truth = test$mobile\n)\ng1 <- ggplot(boost_preds, aes(adaboost, gbm, color = as.factor(truth))) +\n geom_text(aes(label = as.integer(truth) - 1)) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n xlab(\"adaboost margin\") +\n ylab(\"gbm margin\") +\n theme(legend.position = \"none\") +\n scale_color_manual(values = c(\"orange\", \"blue\")) +\n annotate(\"text\",\n x = -4, y = 5, color = red,\n label = paste(\n \"gbm error\\n\",\n round(with(boost_preds, mean((gbm > 0) != truth)), 2)\n )\n ) +\n annotate(\"text\",\n x = 4, y = -5, color = red,\n label = paste(\"adaboost error\\n\", round(with(boost_preds, mean((adaboost > 0) != truth)), 2))\n )\nboost_oob <- tibble(\n adaboost = adab$oobag.improve, gbm = grad_boost$oobag.improve,\n ntrees = 1:500\n)\ng2 <- boost_oob %>%\n pivot_longer(-ntrees, values_to = \"OOB_Error\") %>%\n ggplot(aes(x = ntrees, y = OOB_Error, color = name)) +\n geom_line() +\n scale_color_manual(values = c(orange, blue)) +\n theme(legend.title = element_blank())\nplot_grid(g1, g2, rel_widths = c(.4, .6))"
+ "section": "Procedure",
+ "text": "Procedure\n\nSpecify a kernel function \\(k\\)\nmany people use \\(k(x,x') = \\exp\\left( -d(x, x')/\\gamma\\right)\\) where \\(d(x,x') = \\norm{x-x'}_2^2\\)\nForm \\(\\mathbf{K}_{i,i'} = k(x_i,x_{i'})\\)\nDouble center \\(\\mathbf{K} = \\mathbf{PKP}\\) where \\(\\mathbf{P} = \\mathbf{I}_n - \\mathbf{11}^\\top / n\\)\nTake eigendecomposition \\(\\mathbf{K} = \\U\\D^2\\U^{\\top}\\)\n\nThe scores are still \\(\\mathbf{Z} = \\U_M\\D_M\\)\n\n\n\n\n\n\nNote\n\n\nWe don’t explicitly generate the feature map \\(\\longrightarrow\\) there are NO loadings"
},
{
- "objectID": "schedule/slides/20-boosting.html#major-takeaways",
- "href": "schedule/slides/20-boosting.html#major-takeaways",
+ "objectID": "schedule/slides/25-pca-issues.html#an-alternate-view",
+ "href": "schedule/slides/25-pca-issues.html#an-alternate-view",
"title": "UBC Stat406 2023W",
- "section": "Major takeaways",
- "text": "Major takeaways\n\nTwo flavours of Boosting\n\nAdaBoost (the original) and\ngradient boosting (easier and more computationally friendly)\n\nThe connection is “Forward stagewise additive modelling” (AdaBoost is a special case)\nThe connection reveals that AdaBoost “isn’t robust because it uses exponential loss” (squared error is even worse)\nGradient boosting is a computationally easier version of FSAM\nAll use weak learners (compare to Bagging)\nThink about the Bias-Variance implications\nYou can use these for regression or classification\nYou can do this with other weak learners besides trees."
+ "section": "An alternate view",
+ "text": "An alternate view\nTo get the first PC in classical PCA, we solve \\[\\max_\\alpha \\Var{\\X\\alpha} \\quad \\textrm{ subject to } \\quad \\left|\\left| \\alpha \\right|\\right|_2^2 = 1\\]\nIn the kernel setting we solve \\[\\max_{g \\in \\mathcal{H}_k} \\Var{g(X)} \\quad \\textrm{ subject to } \\quad\\left|\\left| g \\right|\\right|_{\\mathcal{H}_k} = 1\\]\nHere \\(\\mathcal{H}_k\\) is a function space determined by \\(k(x,x')\\)."
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#meta-lecture",
- "href": "schedule/slides/18-the-bootstrap.html#meta-lecture",
+ "objectID": "schedule/slides/25-pca-issues.html#example-kernels",
+ "href": "schedule/slides/25-pca-issues.html#example-kernels",
"title": "UBC Stat406 2023W",
- "section": "18 The bootstrap",
- "text": "18 The bootstrap\nStat 406\nDaniel J. McDonald\nLast modified – 30 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Example kernels",
+ "text": "Example kernels\n\n\\(k(x,x') = x^\\top x'\\)\n\ngives back regular PCA\n\n\\(k(x,x') = (1+x^\\top x')^d\\)\n\ngives a function space which contains all \\(d^{th}\\)-order\n\n\npolynomials.\n\n\\(k(x,x') = \\exp(-\\norm{x-x'}_2^2/\\gamma)\\)\n\ngives a function space spanned by the infinite Fourier basis\n\n\n\nFor more details see [ESL 14.5]"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#a-small-detour",
- "href": "schedule/slides/18-the-bootstrap.html#a-small-detour",
+ "objectID": "schedule/slides/25-pca-issues.html#kernel-pca-on-the-spiral",
+ "href": "schedule/slides/25-pca-issues.html#kernel-pca-on-the-spiral",
"title": "UBC Stat406 2023W",
- "section": "A small detour…",
- "text": "A small detour…"
+ "section": "Kernel PCA on the spiral",
+ "text": "Kernel PCA on the spiral\n\n\nCode\nn <- nrow(df_spiral)\nI_M <- (diag(n) - tcrossprod(rep(1, n)) / n)\nkp <- (tcrossprod(as.matrix(df_spiral[, 1:2])) + 1)^2\nKp <- I_M %*% kp %*% I_M\nEp <- eigen(Kp, symmetric = TRUE)\npolydf <- tibble(\n x = Ep$vectors[, 1] * Ep$values[1],\n y = jit,\n z = df_spiral$z\n)\nkg <- exp(-as.matrix(dist(df_spiral[, 1:2]))^2 / 1)\nKg <- I_M %*% kg %*% I_M\nEg <- eigen(Kg, symmetric = TRUE)\ngaussdf <- tibble(\n x = Eg$vectors[, 1] * Eg$values[1],\n y = jit,\n z = df_spiral$z\n)\ndfkern <- bind_rows(df_spiral, df_spiral2, polydf, gaussdf)\ndfkern$method <- rep(c(\"data\", \"pca\", \"kpoly (d = 2)\", \"kgauss (gamma = 1)\"), each = n)"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#in-statistics",
- "href": "schedule/slides/18-the-bootstrap.html#in-statistics",
+ "objectID": "schedule/slides/25-pca-issues.html#kpca-summary",
+ "href": "schedule/slides/25-pca-issues.html#kpca-summary",
"title": "UBC Stat406 2023W",
- "section": "In statistics…",
- "text": "In statistics…\nThe “bootstrap” works. And well.\nIt’s good for “second-level” analysis.\n\n“First-level” analyses are things like \\(\\hat\\beta\\), \\(\\hat \\y\\), an estimator of the center (a median), etc.\n“Second-level” are things like \\(\\Var{\\hat\\beta}\\), a confidence interval for \\(\\hat \\y\\), or a median, etc.\n\n\nYou usually get these “second-level” properties from “the sampling distribution of an estimator”"
+ "section": "KPCA: summary",
+ "text": "KPCA: summary\nKernel PCA seeks to generalize the notion of similarity using a kernel map\nThis can be interpreted as finding smooth, orthogonal directions in an RKHS\nThis can allow us to start picking up nonlinear (in the original feature space) aspects of our data\nThis new representation can be passed to a supervised method to form a semisupervised learner\nThis kernel is different than kernel smoothing!!\n\nJust like with PCA (and lots of other things) the way you measure distance is important\nThe choice of Kernel is important\nThe embedding dimension must be chosen"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#in-statistics-1",
- "href": "schedule/slides/18-the-bootstrap.html#in-statistics-1",
+ "objectID": "schedule/slides/25-pca-issues.html#basic-semi-supervised-see-islr-6.3.1",
+ "href": "schedule/slides/25-pca-issues.html#basic-semi-supervised-see-islr-6.3.1",
"title": "UBC Stat406 2023W",
- "section": "In statistics…",
- "text": "In statistics…\nThe “bootstrap” works. And well.\nIt’s good for “second-level” analysis.\n\n“First-level” analyses are things like \\(\\hat\\beta\\), \\(\\hat \\y\\), an estimator of the center (a median), etc.\n“Second-level” are things like \\(\\Var{\\hat\\beta}\\), a confidence interval for \\(\\hat \\y\\), or a median, etc.\n\n\nBut what if you don’t know the sampling distribution? Or you’re skeptical of the CLT argument?"
+ "section": "Basic semi-supervised (see [ISLR 6.3.1])",
+ "text": "Basic semi-supervised (see [ISLR 6.3.1])\n\nYou get data \\(\\{(x_1,y_1),\\ldots,(x_n,y_n)\\}\\).\nYou do something unsupervised on \\(\\X\\) to create new features (like PCA).\nYou use the learned features to find a predictor \\(\\hat{f}\\) (say, do OLS on them)"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#in-statistics-2",
- "href": "schedule/slides/18-the-bootstrap.html#in-statistics-2",
+ "objectID": "schedule/slides/23-nnets-other.html#meta-lecture",
+ "href": "schedule/slides/23-nnets-other.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "In statistics…",
- "text": "In statistics…\nThe “bootstrap” works. And well.\nIt’s good for “second-level” analysis.\n\n“First-level” analyses are things like \\(\\hat\\beta\\), \\(\\hat \\y\\), an estimator of the center (a median), etc.\n“Second-level” are things like \\(\\Var{\\hat\\beta}\\), a confidence interval for \\(\\hat \\y\\), or a median, etc.\n\n\nSampling distributions\n\nIf \\(X_i\\) are iid Normal \\((0,\\sigma^2)\\), then \\(\\Var{\\overline{X}} = \\sigma^2 / n\\).\nIf \\(X_i\\) are iid and \\(n\\) is big, then \\(\\Var{\\overline{X}} \\approx \\Var{X_1} / n\\).\nIf \\(X_i\\) are iid Binomial \\((m, p)\\), then \\(\\Var{\\overline{X}} = mp(1-p) / n\\)"
+ "section": "23 Neural nets - other considerations",
+ "text": "23 Neural nets - other considerations\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#example-of-unknown-sampling-distribution",
- "href": "schedule/slides/18-the-bootstrap.html#example-of-unknown-sampling-distribution",
+ "objectID": "schedule/slides/23-nnets-other.html#estimation-procedures-training",
+ "href": "schedule/slides/23-nnets-other.html#estimation-procedures-training",
"title": "UBC Stat406 2023W",
- "section": "Example of unknown sampling distribution",
- "text": "Example of unknown sampling distribution\nI estimate a LDA on some data.\nI get a new \\(x_0\\) and produce \\(\\widehat{Pr}(y_0 =1 \\given x_0)\\).\nCan I get a 95% confidence interval for \\(Pr(y_0=1 \\given x_0)\\)?\n\nThe bootstrap gives this to you."
+ "section": "Estimation procedures (training)",
+ "text": "Estimation procedures (training)\nBack-propagation\nAdvantages:\n\nIt’s updates only depend on local information in the sense that if objects in the hierarchical model are unrelated to each other, the updates aren’t affected\n(This helps in many ways, most notably in parallel architectures)\nIt doesn’t require second-derivative information\nAs the updates are only in terms of \\(\\hat{R}_i\\), the algorithm can be run in either batch or online mode\n\nDown sides:\n\nIt can be very slow\nNeed to choose the learning rate \\(\\gamma_t\\)"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#bootstrap-procedure",
- "href": "schedule/slides/18-the-bootstrap.html#bootstrap-procedure",
+ "objectID": "schedule/slides/23-nnets-other.html#other-algorithms",
+ "href": "schedule/slides/23-nnets-other.html#other-algorithms",
"title": "UBC Stat406 2023W",
- "section": "Bootstrap procedure",
- "text": "Bootstrap procedure\n\nResample your training data w/ replacement.\nCalculate LDA on this sample.\nProduce a new prediction, call it \\(\\widehat{Pr}_b(y_0 =1 \\given x_0)\\).\nRepeat 1-3 \\(b = 1,\\ldots,B\\) times.\nCI: \\(\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]\\)\n\n\n\\(\\hat{F}\\) is the “empirical” distribution of the bootstraps."
+ "section": "Other algorithms",
+ "text": "Other algorithms\nThere are many variations on the fitting algorithm\nStochastic gradient descent: (SGD) discussed in the optimization lecture\nThe rest are variations that use lots of tricks\n\nRMSprop\nAdam\nAdadelta\nAdagrad\nAdamax\nNadam\nFtrl"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#empirical-distribution",
- "href": "schedule/slides/18-the-bootstrap.html#empirical-distribution",
+ "objectID": "schedule/slides/23-nnets-other.html#regularizing-neural-networks",
+ "href": "schedule/slides/23-nnets-other.html#regularizing-neural-networks",
"title": "UBC Stat406 2023W",
- "section": "Empirical distribution",
- "text": "Empirical distribution\n\n\nCode\nr <- rexp(50, 1 / 5)\nggplot(tibble(r = r), aes(r)) + \n stat_ecdf(colour = orange) +\n geom_vline(xintercept = quantile(r, probs = c(.05, .95))) +\n geom_hline(yintercept = c(.05, .95), linetype = \"dashed\") +\n annotate(\n \"label\", x = c(5, 12), y = c(.25, .75), \n label = c(\"hat(F)[boot](.05)\", \"hat(F)[boot](.95)\"), \n parse = TRUE\n )"
+ "section": "Regularizing neural networks",
+ "text": "Regularizing neural networks\nNNets can almost always achieve 0 training error. Even with regularization. Because they have so many parameters.\nFlavours:\n\na complexity penalization term \\(\\longrightarrow\\) solve \\(\\min \\hat{R} + \\rho(\\alpha,\\beta)\\)\nearly stopping on the back propagation algorithm used for fitting\n\n\nWeight decay\n\nThis is like ridge regression in that we penalize the squared Euclidean norm of the weights \\(\\rho(\\mathbf{W},\\mathbf{B}) = \\sum w_i^2 + \\sum b_i^2\\)\n\nWeight elimination\n\nThis encourages more shrinking of small weights \\(\\rho(\\mathbf{W},\\mathbf{B}) = \\sum \\frac{w_i^2}{1+w_i^2} + \\sum \\frac{b_i^2}{1 + b_i^2}\\) or Lasso-type\n\nDropout\n\nIn each epoch, randomly choose \\(z\\%\\) of the nodes and set those weights to zero."
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#very-basic-example",
- "href": "schedule/slides/18-the-bootstrap.html#very-basic-example",
+ "objectID": "schedule/slides/23-nnets-other.html#other-common-pitfalls",
+ "href": "schedule/slides/23-nnets-other.html#other-common-pitfalls",
"title": "UBC Stat406 2023W",
- "section": "Very basic example",
- "text": "Very basic example\n\nLet \\(X_i\\sim \\textrm{Exponential}(1/5)\\). The pdf is \\(f(x) = \\frac{1}{5}e^{-x/5}\\)\nI know if I estimate the mean with \\(\\bar{X}\\), then by the CLT (if \\(n\\) is big),\n\n\\[\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).\\]\n\nThis gives me a 95% confidence interval like \\[\\bar{X} \\pm 2s/\\sqrt{n}\\]\nBut I don’t want to estimate the mean, I want to estimate the median."
+ "section": "Other common pitfalls",
+ "text": "Other common pitfalls\nThere are a few areas to watch out for\nNonconvexity:\nThe neural network optimization problem is non-convex.\nThis makes any numerical solution highly dependent on the initial values. These should be\n\nchosen carefully, typically random near 0. DON’T use all 0.\nregenerated several times to check sensitivity\n\nScaling:\nBe sure to standardize the covariates before training"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#section-1",
- "href": "schedule/slides/18-the-bootstrap.html#section-1",
+ "objectID": "schedule/slides/23-nnets-other.html#other-common-pitfalls-1",
+ "href": "schedule/slides/23-nnets-other.html#other-common-pitfalls-1",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Code\nggplot(data.frame(x = c(0, 12)), aes(x)) +\n stat_function(fun = function(x) dexp(x, 1 / 5), color = orange) +\n geom_vline(xintercept = 5, color = blue) + # mean\n geom_vline(xintercept = qexp(.5, 1 / 5), color = red) + # median\n annotate(\"label\",\n x = c(2.5, 5.5, 10), y = c(.15, .15, .05),\n label = c(\"median\", \"bar(x)\", \"pdf\"), parse = TRUE,\n color = c(red, blue, orange), size = 6\n )"
+ "section": "Other common pitfalls",
+ "text": "Other common pitfalls\nNumber of hidden units:\nIt is generally better to have too many hidden units than too few (regularization can eliminate some).\nSifting the output:\n\nChoose the solution that minimizes training error\nChoose the solution that minimizes the penalized training error\nAverage the solutions across runs"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#now-what",
- "href": "schedule/slides/18-the-bootstrap.html#now-what",
+ "objectID": "schedule/slides/23-nnets-other.html#tuning-parameters",
+ "href": "schedule/slides/23-nnets-other.html#tuning-parameters",
"title": "UBC Stat406 2023W",
- "section": "Now what…",
- "text": "Now what…\n\nI give you a sample of size 500, you give me the sample median.\nHow do you get a CI?\nYou can use the bootstrap!\n\n\nset.seed(406406406)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n\n[1] 3.611615\n\nB <- 100\nalpha <- 0.05\nFhat <- map_dbl(1:B, ~ median(sample(x, replace = TRUE))) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))"
+ "section": "Tuning parameters",
+ "text": "Tuning parameters\nThere are many.\n\nRegularization\nStopping criterion\nlearning rate\nArchitecture\nDropout %\nothers…\n\nThese are hard to tune.\nIn practice, people might choose “some” with a validation set, and fix the rest largely arbitrarily\n\nMore often, people set them all arbitrarily"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#section-2",
- "href": "schedule/slides/18-the-bootstrap.html#section-2",
+ "objectID": "schedule/slides/23-nnets-other.html#thoughts-on-nnets",
+ "href": "schedule/slides/23-nnets-other.html#thoughts-on-nnets",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Code\nggplot(data.frame(Fhat), aes(Fhat)) +\n geom_density(color = orange) +\n geom_vline(xintercept = CI, color = orange, linetype = 2) +\n geom_vline(xintercept = med, col = blue) +\n geom_vline(xintercept = qexp(.5, 1 / 5), col = red) +\n annotate(\"label\",\n x = c(3.15, 3.5, 3.75), y = c(.5, .5, 1),\n color = c(orange, red, blue),\n label = c(\"widehat(F)\", \"true~median\", \"widehat(median)\"),\n parse = TRUE\n ) +\n xlab(\"x\") +\n geom_rug(aes(2 * med - Fhat))"
+ "section": "Thoughts on NNets",
+ "text": "Thoughts on NNets\nOff the top of my head, without lots of justification\n\n\n🤬😡 Why don’t statisticians like them? 🤬😡\n\nThere is little theory (though this is increasing)\nStat theory applies to global minima, here, only local determined by the optimizer\nLittle understanding of when they work\nIn large part, NNets look like logistic regression + feature creation. We understand that well, and in many applications, it performs as well\nExplosion of tuning parameters without a way to decide\nRequire massive datasets to work\nLots of examples where they perform exceedingly poorly\n\n\n\n🔥🔥Why are they hot?🔥🔥\n\nPerform exceptionally well on typical CS tasks (images, translation)\nTake advantage of SOTA computing (parallel, GPUs)\nVery good for multinomial logistic regression\nAn excellent example of “transfer learning”\nThey generate pretty pictures (the nets, pseudo-responses at hidden units)"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#how-does-this-work",
- "href": "schedule/slides/18-the-bootstrap.html#how-does-this-work",
+ "objectID": "schedule/slides/23-nnets-other.html#keras",
+ "href": "schedule/slides/23-nnets-other.html#keras",
"title": "UBC Stat406 2023W",
- "section": "How does this work?",
- "text": "How does this work?"
+ "section": "Keras",
+ "text": "Keras\nMost people who do deep learning use Python \\(+\\) Keras \\(+\\) Tensorflow\nIt takes some work to get all this software up and running.\nIt is possible to do in with R using an interface to Keras.\n\nI used to try to do a walk-through, but the interface is quite brittle\nIf you want to explore, see the handout:\n\nKnitted: https://ubc-stat.github.io/stat-406-lectures/handouts/keras-nnet.html\nRmd: https://ubc-stat.github.io/stat-406-lectures/handouts/keras-nnet.Rmd"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#approximations",
- "href": "schedule/slides/18-the-bootstrap.html#approximations",
+ "objectID": "schedule/slides/23-nnets-other.html#section",
+ "href": "schedule/slides/23-nnets-other.html#section",
"title": "UBC Stat406 2023W",
- "section": "Approximations",
- "text": "Approximations"
+ "section": "",
+ "text": "The Bias-Variance Trade-Off & \"DOUBLE DESCENT\" 🧵Remember the bias-variance trade-off? It says that models perform well for an \"intermediate level of flexibility\". You've seen the picture of the U-shape test error curve.We try to hit the \"sweet spot\" of flexibility.1/🧵 pic.twitter.com/HPk05izkZh— Daniela Witten (@daniela_witten) August 9, 2020"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#slightly-harder-example",
- "href": "schedule/slides/18-the-bootstrap.html#slightly-harder-example",
+ "objectID": "schedule/slides/23-nnets-other.html#where-does-this-u-shape-come-from",
+ "href": "schedule/slides/23-nnets-other.html#where-does-this-u-shape-come-from",
"title": "UBC Stat406 2023W",
- "section": "Slightly harder example",
- "text": "Slightly harder example\n\n\n\nggplot(fatcats, aes(Bwt, Hwt)) +\n geom_point(color = blue) +\n xlab(\"Cat body weight (Kg)\") +\n ylab(\"Cat heart weight (g)\")\n\n\n\n\n\n\n\n\n\n\n\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nsummary(cats.lm)\n\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-11.2353 -0.7932 -0.1407 0.5968 11.1026 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.95424 0.06294 62.83 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.089 on 143 degrees of freedom\nMultiple R-squared: 0.965, Adjusted R-squared: 0.9648 \nF-statistic: 3947 on 1 and 143 DF, p-value: < 2.2e-16\n\nconfint(cats.lm)\n\n 2.5 % 97.5 %\nBwt 3.829836 4.078652"
+ "section": "Where does this U shape come from?",
+ "text": "Where does this U shape come from?\nMSE = Squared Bias + Variance + Irreducible Noise\nAs we increase flexibility:\n\nSquared bias goes down\nVariance goes up\nEventually, | \\(\\partial\\) Variance | \\(>\\) | \\(\\partial\\) Squared Bias |.\n\nGoal: Choose amount of flexibility to balance these and minimize MSE.\n\nUse CV or something to estimate MSE and decide how much flexibility."
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#when-we-fit-models-we-examine-diagnostics",
- "href": "schedule/slides/18-the-bootstrap.html#when-we-fit-models-we-examine-diagnostics",
+ "objectID": "schedule/slides/23-nnets-other.html#section-1",
+ "href": "schedule/slides/23-nnets-other.html#section-1",
"title": "UBC Stat406 2023W",
- "section": "When we fit models, we examine diagnostics",
- "text": "When we fit models, we examine diagnostics\n\n\n\nqqnorm(residuals(cats.lm), pch = 16, col = blue)\nqqline(residuals(cats.lm), col = orange, lwd = 2)\n\n\n\n\n\n\n\n\nThe tails are too fat. So I don’t believe that CI…\n\n\nWe bootstrap\n\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n\n 97.5% 2.5% \n3.826735 4.084322 \n\nconfint(cats.lm) # Original CI\n\n 2.5 % 97.5 %\nBwt 3.829836 4.078652"
+ "section": "",
+ "text": "In the past few yrs, (and particularly in the context of deep learning) ppl have noticed \"double descent\" -- when you continue to fit increasingly flexible models that interpolate the training data, then the test error can start to DECREASE again!! Check it out: 3/ pic.twitter.com/Vo54tRVRNG— Daniela Witten (@daniela_witten) August 9, 2020"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#an-alternative",
- "href": "schedule/slides/18-the-bootstrap.html#an-alternative",
+ "objectID": "schedule/slides/23-nnets-other.html#zero-training-error-and-model-saturation",
+ "href": "schedule/slides/23-nnets-other.html#zero-training-error-and-model-saturation",
"title": "UBC Stat406 2023W",
- "section": "An alternative",
- "text": "An alternative\n\nSo far, I didn’t use any information about the data-generating process.\nWe’ve done the non-parametric bootstrap\nThis is easiest, and most common for the methods in this module\n\n\nBut there’s another version\n\nYou could try a “parametric bootstrap”\nThis assumes knowledge about the DGP"
+ "section": "Zero training error and model saturation",
+ "text": "Zero training error and model saturation\n\nIn Deep Learning, the recommendation is to “fit until you get zero training error”\nThis somehow magically, leads to a continued decrease in test error.\nSo, who cares about the Bias-Variance Trade off!!\n\n\nLesson:\nBV Trade off is not wrong. 😢\nThis is a misunderstanding of black box algorithms and flexibility.\nWe don’t even need deep learning to illustrate."
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#same-data",
- "href": "schedule/slides/18-the-bootstrap.html#same-data",
+ "objectID": "schedule/slides/23-nnets-other.html#section-2",
+ "href": "schedule/slides/23-nnets-other.html#section-2",
"title": "UBC Stat406 2023W",
- "section": "Same data",
- "text": "Same data\n\n\nNon-parametric bootstrap\nSame as before\n\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n\n 97.5% 2.5% \n3.832907 4.070232 \n\nconfint(cats.lm) # Original CI\n\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n\n\n\n\nParametric bootstrap\n\nAssume that the linear model is TRUE.\nThen, \\(\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i\\), \\(\\widehat{e}_i \\approx \\epsilon_i\\)\nThe \\(\\epsilon_i\\) is random \\(\\longrightarrow\\) just resample \\(\\widehat{e}_i\\).\n\n\nB <- 500\nbhats <- double(B)\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nr <- residuals(cats.lm)\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |> mutate(\n Hwt = predict(cats.lm) +\n sample(r, n(), replace = TRUE)\n )\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n\n 97.5% 2.5% \n3.815162 4.065045"
+ "section": "",
+ "text": "library(splines)\nset.seed(20221102)\nn <- 20\ndf <- tibble(\n x = seq(-1.5 * pi, 1.5 * pi, length.out = n),\n y = sin(x) + runif(n, -0.5, 0.5)\n)\ng <- ggplot(df, aes(x, y)) + geom_point() + stat_function(fun = sin) + ylim(c(-2, 2))\ng + stat_smooth(method = lm, formula = y ~ bs(x, df = 4), se = FALSE, color = green) + # too smooth\n stat_smooth(method = lm, formula = y ~ bs(x, df = 8), se = FALSE, color = orange) # looks good"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#bootstrap-error-sources",
- "href": "schedule/slides/18-the-bootstrap.html#bootstrap-error-sources",
+ "objectID": "schedule/slides/23-nnets-other.html#section-3",
+ "href": "schedule/slides/23-nnets-other.html#section-3",
"title": "UBC Stat406 2023W",
- "section": "Bootstrap error sources",
- "text": "Bootstrap error sources\n\n\nSimulation error\n\nusing only \\(B\\) samples to estimate \\(F\\) with \\(\\hat{F}\\).\n\nStatistical error\n\nour data depended on a sample from the population. We don’t have the whole population so we make an error by using a sample (Note: this part is what always happens with data, and what the science of statistics analyzes.)\n\nSpecification error\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident."
+ "section": "",
+ "text": "xn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)\n# Spline by hand\nX <- bs(df$x, df = 20, intercept = TRUE)\nXn <- bs(xn, df = 20, intercept = TRUE)\nS <- svd(X)\nyhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)\ng + geom_line(data = tibble(x=xn, y=yhat), colour = orange) +\n ggtitle(\"20 degrees of freedom\")"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#types-of-intervals",
- "href": "schedule/slides/18-the-bootstrap.html#types-of-intervals",
+ "objectID": "schedule/slides/23-nnets-other.html#section-4",
+ "href": "schedule/slides/23-nnets-other.html#section-4",
"title": "UBC Stat406 2023W",
- "section": "Types of intervals",
- "text": "Types of intervals\nLet \\(\\hat{\\theta}\\) be our sample statistic, \\(\\hat{\\Theta}\\) be the resamples\nOur interval is\n\\[\n[2\\hat{\\theta} - \\theta^*_{1-\\alpha/2},\\ 2\\hat{\\theta} - \\theta^*_{\\alpha/2}]\n\\]\nwhere \\(\\theta^*_q\\) is the \\(q\\) quantile of \\(\\hat{\\Theta}\\).\n\n\nCalled the “Pivotal Interval”\nHas the correct \\(1-\\alpha\\)% coverage under very mild conditions on \\(\\hat{\\theta}\\)"
+ "section": "",
+ "text": "xn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)\n# Spline by hand\nX <- bs(df$x, df = 40, intercept = TRUE)\nXn <- bs(xn, df = 40, intercept = TRUE)\nS <- svd(X)\nyhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)\ng + geom_line(data = tibble(x = xn, y = yhat), colour = orange) +\n ggtitle(\"40 degrees of freedom\")"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#types-of-intervals-1",
- "href": "schedule/slides/18-the-bootstrap.html#types-of-intervals-1",
+ "objectID": "schedule/slides/23-nnets-other.html#what-happened",
+ "href": "schedule/slides/23-nnets-other.html#what-happened",
"title": "UBC Stat406 2023W",
- "section": "Types of intervals",
- "text": "Types of intervals\nLet \\(\\hat{\\theta}\\) be our sample statistic, \\(\\hat{\\Theta}\\) be the resamples\n\\[\n[\\hat{\\theta} - t_{\\alpha/2}s,\\ \\hat{\\theta} - t_{\\alpha/2}s]\n\\]\nwhere \\(\\hat{s} = \\sqrt{\\Var{\\hat{\\Theta}}}\\)\n\n\nCalled the “Normal Interval”\nOnly works if the distribution of \\(\\hat{\\Theta}\\) is approximately Normal.\nUnlikely to work well\nDon’t do this"
+ "section": "What happened?!",
+ "text": "What happened?!\n\ndoffs <- 4:50\nmse <- function(x, y) mean((x - y)^2)\nget_errs <- function(doff) {\n X <- bs(df$x, df = doff, intercept = TRUE)\n Xn <- bs(xn, df = doff, intercept = TRUE)\n S <- svd(X)\n yh <- S$u %*% crossprod(S$u, df$y)\n bhat <- S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)\n yhat <- Xn %*% S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)\n nb <- sqrt(sum(bhat^2))\n tibble(train = mse(df$y, yh), test = mse(yhat, sin(xn)), norm = nb)\n}\nerrs <- map(doffs, get_errs) |>\n list_rbind() |> \n mutate(`degrees of freedom` = doffs) |> \n pivot_longer(train:test, values_to = \"error\")"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#types-of-intervals-2",
- "href": "schedule/slides/18-the-bootstrap.html#types-of-intervals-2",
+ "objectID": "schedule/slides/23-nnets-other.html#what-happened-1",
+ "href": "schedule/slides/23-nnets-other.html#what-happened-1",
"title": "UBC Stat406 2023W",
- "section": "Types of intervals",
- "text": "Types of intervals\nLet \\(\\hat{\\theta}\\) be our sample statistic, \\(\\hat{\\Theta}\\) be the resamples\n\\[\n[\\theta^*_{\\alpha/2},\\ \\theta^*_{1-\\alpha/2}]\n\\]\nwhere \\(\\theta^*_q\\) is the \\(q\\) quantile of \\(\\hat{\\Theta}\\).\n\n\nCalled the “Percentile Interval”\nWorks if \\(\\exists\\) monotone \\(m\\) so that \\(m(\\hat\\Theta) \\sim N(m(\\theta), c^2)\\)\nBetter than the Normal Interval\nMore assumptions than the Pivotal Interval"
+ "section": "What happened?!",
+ "text": "What happened?!\n\n\nCode\nggplot(errs, aes(`degrees of freedom`, error, color = name)) +\n geom_line(linewidth = 2) + \n coord_cartesian(ylim = c(0, .12)) +\n scale_x_log10() + \n scale_colour_manual(values = c(blue, orange), name = \"\") +\n geom_vline(xintercept = 20)"
},
{
- "objectID": "schedule/slides/18-the-bootstrap.html#more-details",
- "href": "schedule/slides/18-the-bootstrap.html#more-details",
+ "objectID": "schedule/slides/23-nnets-other.html#what-happened-2",
+ "href": "schedule/slides/23-nnets-other.html#what-happened-2",
"title": "UBC Stat406 2023W",
- "section": "More details",
- "text": "More details\n\nSee “All of Statistics” by Larry Wasserman, Chapter 8.3\n\nThere’s a handout with the proofs on Canvas (under Modules)"
+ "section": "What happened?!",
+ "text": "What happened?!\n\n\nCode\nbest_test <- errs |> filter(name == \"test\")\nmin_norm <- best_test$norm[which.min(best_test$error)]\nggplot(best_test, aes(norm, error)) +\n geom_line(colour = blue, size = 2) + ylab(\"test error\") +\n geom_vline(xintercept = min_norm, colour = orange) +\n scale_y_log10() + scale_x_log10() + geom_vline(xintercept = 20)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#meta-lecture",
- "href": "schedule/slides/16-logistic-regression.html#meta-lecture",
- "title": "UBC Stat406 2023W",
- "section": "16 Logistic regression",
- "text": "16 Logistic regression\nStat 406\nDaniel J. McDonald\nLast modified – 25 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "objectID": "schedule/slides/23-nnets-other.html#degrees-of-freedom-and-complexity",
+ "href": "schedule/slides/23-nnets-other.html#degrees-of-freedom-and-complexity",
+ "title": "UBC Stat406 2023W",
+ "section": "Degrees of freedom and complexity",
+ "text": "Degrees of freedom and complexity\n\nIn low dimensions (where \\(n \\gg p\\)), with linear smoothers, df and model complexity are roughly the same.\nBut this relationship breaks down in more complicated settings\nWe’ve already seen this:\n\n\nlibrary(glmnet)\nout <- cv.glmnet(X, df$y, nfolds = n) # leave one out\n\n\n\nCode\nwith(\n out, \n tibble(lambda = lambda, df = nzero, cv = cvm, cvup = cvup, cvlo = cvlo )\n) |> \n filter(df > 0) |>\n pivot_longer(lambda:df) |> \n ggplot(aes(x = value)) +\n geom_errorbar(aes(ymax = cvup, ymin = cvlo)) +\n geom_point(aes(y = cv), colour = orange) +\n facet_wrap(~ name, strip.position = \"bottom\", scales = \"free_x\") +\n scale_y_log10() +\n scale_x_log10() + theme(axis.title.x = element_blank())"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#last-time",
- "href": "schedule/slides/16-logistic-regression.html#last-time",
+ "objectID": "schedule/slides/23-nnets-other.html#infinite-solutions",
+ "href": "schedule/slides/23-nnets-other.html#infinite-solutions",
"title": "UBC Stat406 2023W",
- "section": "Last time",
- "text": "Last time\n\nWe showed that with two classes, the Bayes’ classifier is\n\n\\[g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}\\]\nwhere \\(p_1(X) = Pr(X \\given Y=1)\\) and \\(p_0(X) = Pr(X \\given Y=0)\\)\n\nWe then looked at what happens if we assume \\(Pr(X \\given Y=y)\\) is Normally distributed.\n\nWe then used this distribution and the class prior \\(\\pi\\) to find the posterior \\(Pr(Y=1 \\given X=x)\\)."
+ "section": "Infinite solutions",
+ "text": "Infinite solutions\n\nIn Lasso, df is not really the right measure of complexity\nBetter is \\(\\lambda\\) or the norm of the coefficients (these are basically the same)\nSo what happened with the Splines?\n\n\n\nWhen df \\(= 20\\), there’s a unique solution that interpolates the data\nWhen df \\(> 20\\), there are infinitely many solutions that interpolate the data.\n\nBecause we used the SVD to solve the system, we happened to pick one: the one that has the smallest \\(\\Vert\\hat\\beta\\Vert_2\\)\nRecent work in Deep Learning shows that SGD has the same property: it returns the local optima with the smallest norm.\nIf we measure complexity in terms of the norm of the weights, rather than by counting parameters, we don’t see double descent anymore."
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#direct-model",
- "href": "schedule/slides/16-logistic-regression.html#direct-model",
+ "objectID": "schedule/slides/23-nnets-other.html#the-lesson",
+ "href": "schedule/slides/23-nnets-other.html#the-lesson",
"title": "UBC Stat406 2023W",
- "section": "Direct model",
- "text": "Direct model\nInstead, let’s directly model the posterior\n\\[\n\\begin{aligned}\nPr(Y = 1 \\given X=x) & = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}} \\\\\nPr(Y = 0 | X=x) & = \\frac{1}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}=1-\\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\n\\end{aligned}\n\\]\nThis is logistic regression."
+ "section": "The lesson",
+ "text": "The lesson\n\nDeep learning isn’t magic.\nZero training error with lots of parameters doesn’t mean good test error.\nWe still need the bias variance tradeoff\nIt’s intuition still applies: more flexibility eventually leads to increased MSE\nBut we need to be careful how we measure complexity.\n\n\n\nThere is very interesting recent theory that says when we can expect lower test error to the right of the interpolation threshold than to the left."
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#why-this",
- "href": "schedule/slides/16-logistic-regression.html#why-this",
+ "objectID": "schedule/slides/21-nnets-intro.html#meta-lecture",
+ "href": "schedule/slides/21-nnets-intro.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Why this?",
- "text": "Why this?\n\\[Pr(Y = 1 \\given X=x) = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\\]\n\nThere are lots of ways to map \\(\\R \\mapsto [0,1]\\).\nThe “logistic” function \\(z\\mapsto (1 + \\exp(-z))^{-1} = \\exp(z) / (1+\\exp(z)) =:h(z)\\) is nice.\nIt’s symmetric: \\(1 - h(z) = h(-z)\\)\nHas a nice derivative: \\(h'(z) = \\frac{\\exp(z)}{(1 + \\exp(z))^2} = h(z)(1-h(z))\\).\nIt’s the inverse of the “log-odds” (logit): \\(\\log(p / (1-p))\\)."
+ "section": "21 Neural nets",
+ "text": "21 Neural nets\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#another-linear-classifier",
- "href": "schedule/slides/16-logistic-regression.html#another-linear-classifier",
+ "objectID": "schedule/slides/21-nnets-intro.html#overview",
+ "href": "schedule/slides/21-nnets-intro.html#overview",
"title": "UBC Stat406 2023W",
- "section": "Another linear classifier",
- "text": "Another linear classifier\nLike LDA, logistic regression is a linear classifier\nThe logit (i.e.: log odds) transformation gives a linear decision boundary \\[\\log\\left( \\frac{\\P(Y = 1 \\given X=x)}{\\P(Y = 0 \\given X=x) } \\right) = \\beta_0 + \\beta^{\\top} x\\] The decision boundary is the hyperplane \\(\\{x : \\beta_0 + \\beta^{\\top} x = 0\\}\\)\nIf the log-odds are below 0, classify as 0, above 0 classify as a 1."
+ "section": "Overview",
+ "text": "Overview\nNeural networks are models for supervised learning\nLinear combinations of features are passed through a non-linear transformation in successive layers\nAt the top layer, the resulting latent factors are fed into an algorithm for predictions\n(Most commonly via least squares or logistic loss)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#logistic-regression-is-also-easy-in-r",
- "href": "schedule/slides/16-logistic-regression.html#logistic-regression-is-also-easy-in-r",
+ "objectID": "schedule/slides/21-nnets-intro.html#background",
+ "href": "schedule/slides/21-nnets-intro.html#background",
"title": "UBC Stat406 2023W",
- "section": "Logistic regression is also easy in R",
- "text": "Logistic regression is also easy in R\n\nlogistic <- glm(y ~ ., dat, family = \"binomial\")\n\nOr we can use lasso or ridge regression or a GAM as before\n\nlasso_logit <- cv.glmnet(x, y, family = \"binomial\")\nridge_logit <- cv.glmnet(x, y, alpha = 0, family = \"binomial\")\ngam_logit <- gam(y ~ s(x), data = dat, family = \"binomial\")\n\n\n\nglm means generalized linear model"
+ "section": "Background",
+ "text": "Background\n\n\nNeural networks have come about in 3 “waves”\nThe first was an attempt in the 1950s to model the mechanics of the human brain\nIt appeared the brain worked by\n\ntaking atomic units known as neurons, which can be “on” or “off”\nputting them in networks\n\nA neuron itself interprets the status of other neurons\nThere weren’t really computers, so we couldn’t estimate these things"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#baby-example-continued-from-last-time",
- "href": "schedule/slides/16-logistic-regression.html#baby-example-continued-from-last-time",
+ "objectID": "schedule/slides/21-nnets-intro.html#background-1",
+ "href": "schedule/slides/21-nnets-intro.html#background-1",
"title": "UBC Stat406 2023W",
- "section": "Baby example (continued from last time)",
- "text": "Baby example (continued from last time)\n\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))\nlogit <- glm(y ~ ., dat1 |> mutate(y = y - 1), family = \"binomial\")\nsummary(logit)\n\n\nCall:\nglm(formula = y ~ ., family = \"binomial\", data = mutate(dat1, \n y = y - 1))\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -2.6649 0.6281 -4.243 2.21e-05 ***\nx1 2.5305 0.5995 4.221 2.43e-05 ***\nx2 1.6610 0.4365 3.805 0.000142 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 138.469 on 99 degrees of freedom\nResidual deviance: 68.681 on 97 degrees of freedom\nAIC: 74.681\n\nNumber of Fisher Scoring iterations: 6"
+ "section": "Background",
+ "text": "Background\nAfter the development of parallel, distributed computation in the 1980s, this “artificial intelligence” view was diminished\nAnd neural networks gained popularity\nBut, the growing popularity of SVMs and boosting/bagging in the late 1990s, neural networks again fell out of favor\nThis was due to many of the problems we’ll discuss (non-convexity being the main one)\n\nIn the mid 2000’s, new approaches for initializing neural networks became available\nThese approaches are collectively known as deep learning\nState-of-the-art performance on various classification tasks has been accomplished via neural networks\nToday, Neural Networks/Deep Learning are the hottest…"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#visualizing-the-classification-boundary",
- "href": "schedule/slides/16-logistic-regression.html#visualizing-the-classification-boundary",
+ "objectID": "schedule/slides/21-nnets-intro.html#high-level-overview",
+ "href": "schedule/slides/21-nnets-intro.html#high-level-overview",
"title": "UBC Stat406 2023W",
- "section": "Visualizing the classification boundary",
- "text": "Visualizing the classification boundary\n\n\nCode\ngr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), \n x2 = seq(-2.5, 3, length.out = 100))\npts <- predict(logit, gr)\ng0 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_steps2(n.breaks = 6, name = \"log odds\") \ng0"
+ "section": "High level overview",
+ "text": "High level overview"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#calculation",
- "href": "schedule/slides/16-logistic-regression.html#calculation",
+ "objectID": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression",
+ "href": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression",
"title": "UBC Stat406 2023W",
- "section": "Calculation",
- "text": "Calculation\nWhile the R formula for logistic regression is straightforward, it’s not as easy to compute as OLS or LDA or QDA.\nLogistic regression for two classes simplifies to a likelihood:\nWrite \\(p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)\\)\n\n\\(P(Y_i = y_i \\given X = x_i, \\beta) = p_i^{y_i}(1-p_i)^{1-y_i}\\) (…Bernoulli distribution)\n\\(P(\\mathbf{Y} \\given \\mathbf{X}, \\beta) = \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\\)."
+ "section": "Recall nonparametric regression",
+ "text": "Recall nonparametric regression\nSuppose \\(Y \\in \\mathbb{R}\\) and we are trying estimate the regression function \\[\\Expect{Y\\given X} = f_*(X)\\]\nIn Module 2, we discussed basis expansion,\n\nWe know \\(f_*(x) =\\sum_{k=1}^\\infty \\beta_k h_k(x)\\) some basis \\(h_1,h_2,\\ldots\\) (using \\(h\\) instead of \\(\\phi\\) to match ISLR)\nTruncate this expansion at \\(K\\): \\(f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k h_k(x)\\)\nEstimate \\(\\beta_k\\) with least squares"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#calculation-1",
- "href": "schedule/slides/16-logistic-regression.html#calculation-1",
+ "objectID": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression-1",
+ "href": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression-1",
"title": "UBC Stat406 2023W",
- "section": "Calculation",
- "text": "Calculation\nWrite \\(p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)\\)\n\\[\n\\begin{aligned}\n\\ell(\\beta)\n& = \\log \\left( \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \\right)\\\\\n&=\\sum_{i=1}^n \\left( y_i\\log(p_i(\\beta)) + (1-y_i)\\log(1-p_i(\\beta))\\right) \\\\\n& =\n\\sum_{i=1}^n \\left( y_i\\log(e^{\\beta^{\\top}x_i}/(1+e^{\\beta^{\\top}x_i})) - (1-y_i)\\log(1+e^{\\beta^{\\top}x_i})\\right) \\\\\n& =\n\\sum_{i=1}^n \\left( y_i\\beta^{\\top}x_i -\\log(1 + e^{\\beta^{\\top} x_i})\\right)\n\\end{aligned}\n\\]\nThis gets optimized via Newton-Raphson updates and iteratively reweighed least squares."
+ "section": "Recall nonparametric regression",
+ "text": "Recall nonparametric regression\nThe weaknesses of this approach are:\n\nThe basis is fixed and independent of the data\nIf \\(p\\) is large, then nonparametrics doesn’t work well at all (recall the Curse of Dimensionality)\nIf the basis doesn’t “agree” with \\(f_*\\), then \\(K\\) will have to be large to capture the structure\nWhat if parts of \\(f_*\\) have substantially different structure? Say \\(f_*(x)\\) really wiggly for \\(x \\in [-1,3]\\) but smooth elsewhere\n\nAn alternative would be to have the data tell us what kind of basis to use (Module 5)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#irwls-for-logistic-regression-skip-for-now",
- "href": "schedule/slides/16-logistic-regression.html#irwls-for-logistic-regression-skip-for-now",
+ "objectID": "schedule/slides/21-nnets-intro.html#layer-for-regression",
+ "href": "schedule/slides/21-nnets-intro.html#layer-for-regression",
"title": "UBC Stat406 2023W",
- "section": "IRWLS for logistic regression (skip for now)",
- "text": "IRWLS for logistic regression (skip for now)\n(This is preparation for Neural Networks.)\n\nlogit_irwls <- function(y, x, maxit = 100, tol = 1e-6) {\n p <- ncol(x)\n beta <- double(p) # initialize coefficients\n beta0 <- 0\n conv <- FALSE # hasn't converged\n iter <- 1 # first iteration\n while (!conv && (iter < maxit)) { # check loops\n iter <- iter + 1 # update first thing (so as not to forget)\n eta <- beta0 + x %*% beta\n mu <- 1 / (1 + exp(-eta))\n gp <- 1 / (mu * (1 - mu)) # inverse of derivative of logistic\n z <- eta + (y - mu) * gp # effective transformed response\n beta_new <- coef(lm(z ~ x, weights = 1 / gp)) # do Weighted Least Squares\n conv <- mean(abs(c(beta0, beta) - betaNew)) < tol # check if the betas are \"moving\"\n beta0 <- betaNew[1] # update betas\n beta <- betaNew[-1]\n }\n return(c(beta0, beta))\n}"
+ "section": "1-layer for Regression",
+ "text": "1-layer for Regression\n\n\nA single layer neural network model is \\[\n\\begin{aligned}\n&f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ g(w_{k0} + w_k^{\\top}x)\\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n\\]\nCompare: A nonparametric regression \\[f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k {\\phi_k(x)}\\]"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression",
- "href": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression",
+ "objectID": "schedule/slides/21-nnets-intro.html#terminology",
+ "href": "schedule/slides/21-nnets-intro.html#terminology",
"title": "UBC Stat406 2023W",
- "section": "Comparing LDA and Logistic regression",
- "text": "Comparing LDA and Logistic regression\nBoth decision boundaries are linear in \\(x\\):\n\nLDA \\(\\longrightarrow \\alpha_0 + \\alpha_1^\\top x\\)\nLogit \\(\\longrightarrow \\beta_0 + \\beta_1^\\top x\\).\n\nBut the parameters are estimated differently."
+ "section": "Terminology",
+ "text": "Terminology\n\\[f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}\\] The main components are\n\nThe derived features \\({A_k = g(w_{k0} + w_k^{\\top}x)}\\) and are called the hidden units or activations\nThe function \\(g\\) is called the activation function (more on this later)\nThe parameters \\({\\beta_0},{\\beta_k},{w_{k0}},{w_k}\\) are estimated from the data for all \\(k = 1,\\ldots, K\\).\nThe number of hidden units \\({K}\\) is a tuning parameter\n\\(\\beta_0\\) and \\(w_{k0}\\) are usually called biases (I’m going to set them to 0 and ignore them in future formulas. Just for space. It’s just an intercept)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression-1",
- "href": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression-1",
+ "objectID": "schedule/slides/21-nnets-intro.html#terminology-1",
+ "href": "schedule/slides/21-nnets-intro.html#terminology-1",
"title": "UBC Stat406 2023W",
- "section": "Comparing LDA and Logistic regression",
- "text": "Comparing LDA and Logistic regression\nExamine the joint distribution of \\((X,\\ Y)\\) (not the posterior):\n\nLDA: \\(f(X_i,\\ Y_i) = \\underbrace{ f(X_i \\given Y_i)}_{\\textrm{Gaussian}}\\underbrace{ f(Y_i)}_{\\textrm{Bernoulli}}\\)\nLogistic Regression: \\(f(X_i,Y_i) = \\underbrace{ f(Y_i\\given X_i)}_{\\textrm{Logistic}}\\underbrace{ f(X_i)}_{\\textrm{Ignored}}\\)\nLDA estimates the joint, but Logistic estimates only the conditional (posterior) distribution. But this is really all we need.\nSo logistic requires fewer assumptions.\nBut if the two classes are perfectly separable, logistic crashes (and the MLE is undefined, too many solutions)\nLDA “works” even if the conditional isn’t normal, but works very poorly if any X is qualitative"
+ "section": "Terminology",
+ "text": "Terminology\n\\[f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}\\]\nNotes (no biases):\n\\(\\beta \\in \\R^k\\).\n\\(w_k \\in \\R^p,\\ k = 1,\\ldots,K\\)\n\\(\\mathbf{W} \\in \\R^{K\\times p}\\)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#comparing-with-qda-2-classes",
- "href": "schedule/slides/16-logistic-regression.html#comparing-with-qda-2-classes",
+ "objectID": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers",
+ "href": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers",
"title": "UBC Stat406 2023W",
- "section": "Comparing with QDA (2 classes)",
- "text": "Comparing with QDA (2 classes)\n\nRecall: this gives a “quadratic” decision boundary (it’s a curve).\nIf we have \\(p\\) columns in \\(X\\)\n\nLogistic estimates \\(p+1\\) parameters\nLDA estimates \\(2p + p(p+1)/2 + 1\\)\nQDA estimates \\(2p + p(p+1) + 1\\)\n\nIf \\(p=50\\),\n\nLogistic: 51\nLDA: 1376\nQDA: 2651\n\nQDA doesn’t get used much: there are better nonlinear versions with way fewer parameters"
+ "section": "What about classification (10 classes, 2 layers)",
+ "text": "What about classification (10 classes, 2 layers)\n\n\n\\[\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n\\]\n\n\n\n\n\n\n\n\n\nPredict class with largest probability: \\(\\hat{Y} = \\argmax_{m} f_m(x)\\)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#bad-parameter-counting",
- "href": "schedule/slides/16-logistic-regression.html#bad-parameter-counting",
+ "objectID": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers-1",
+ "href": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers-1",
"title": "UBC Stat406 2023W",
- "section": "Bad parameter counting",
- "text": "Bad parameter counting\nI’ve motivated LDA as needing \\(\\Sigma\\), \\(\\pi\\) and \\(\\mu_0\\), \\(\\mu_1\\)\nIn fact, we don’t need all of this to get the decision boundary.\nSo the “degrees of freedom” is much lower if we only want the classes and not the probabilities.\nThe decision boundary only really depends on\n\n\\(\\Sigma^{-1}(\\mu_1-\\mu_0)\\)\n\\((\\mu_1+\\mu_0)\\),\nso appropriate algorithms estimate \\(<2p\\) parameters."
+ "section": "What about classification (10 classes, 2 layers)",
+ "text": "What about classification (10 classes, 2 layers)\n\n\nNotes:\n\\(B \\in \\R^{M\\times K_2}\\) (here \\(M=10\\)).\n\\(\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}\\)\n\\(\\mathbf{W}_1 \\in \\R^{K_1\\times p}\\)"
},
{
- "objectID": "schedule/slides/16-logistic-regression.html#note-again",
- "href": "schedule/slides/16-logistic-regression.html#note-again",
+ "objectID": "schedule/slides/21-nnets-intro.html#two-observations",
+ "href": "schedule/slides/21-nnets-intro.html#two-observations",
"title": "UBC Stat406 2023W",
- "section": "Note again:",
- "text": "Note again:\nwhile logistic regression and LDA produce linear decision boundaries, they are not linear smoothers\nAIC/BIC/Cp work if you use the likelihood correctly and count degrees-of-freedom correctly\nMust people use either test set or CV"
+ "section": "Two observations",
+ "text": "Two observations\n\nThe \\(g\\) function generates a feature map\n\nWe start with \\(p\\) covariates and we generate \\(K\\) features (1-layer)\n\n\nLogistic / Least-squares with a polynomial transformation\n\\[\n\\begin{aligned}\n&\\Phi(x) \\\\\n& =\n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &= \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n\\]\n\n\nNeural network\n\\[\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\\n&= \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#meta-lecture",
- "href": "schedule/slides/14-classification-intro.html#meta-lecture",
+ "objectID": "schedule/slides/21-nnets-intro.html#two-observations-1",
+ "href": "schedule/slides/21-nnets-intro.html#two-observations-1",
"title": "UBC Stat406 2023W",
- "section": "14 Classification",
- "text": "14 Classification\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Two observations",
+ "text": "Two observations\n\nIf \\(g(u) = u\\), (or \\(=3u\\)) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n\nReLU is the current fashion (used to be tanh or logistic)"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#an-overview-of-classification",
- "href": "schedule/slides/14-classification-intro.html#an-overview-of-classification",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#meta-lecture",
+ "href": "schedule/slides/19-bagging-and-rf.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "An Overview of Classification",
- "text": "An Overview of Classification\n\nA person arrives at an emergency room with a set of symptoms that could be 1 of 3 possible conditions. Which one is it?\nAn online banking service must be able to determine whether each transaction is fraudulent or not, using a customer’s location, past transaction history, etc.\nGiven a set of individuals sequenced DNA, can we determine whether various mutations are associated with different phenotypes?\n\n\nThese problems are not regression problems. They are classification problems."
+ "section": "19 Bagging and random forests",
+ "text": "19 Bagging and random forests\nStat 406\nDaniel J. McDonald\nLast modified – 11 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#the-set-up",
- "href": "schedule/slides/14-classification-intro.html#the-set-up",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#bagging",
+ "href": "schedule/slides/19-bagging-and-rf.html#bagging",
"title": "UBC Stat406 2023W",
- "section": "The Set-up",
- "text": "The Set-up\nIt begins just like regression: suppose we have observations \\[\\{(x_1,y_1),\\ldots,(x_n,y_n)\\}\\]\nAgain, we want to estimate a function that maps \\(X\\) to \\(Y\\) to predict as yet observed data.\n(This function is known as a classifier)\nThe same constraints apply:\n\nWe want a classifier that predicts test data, not just the training data.\nOften, this comes with the introduction of some bias to get lower variance and better predictions."
+ "section": "Bagging",
+ "text": "Bagging\nMany methods (trees, nonparametric smoothers) tend to have low bias but high variance.\nEspecially fully grown trees (that’s why we prune them)\n\nHigh-variance\n\nif we split the training data into two parts at random and fit a decision tree to each part, the results will be quite different.\n\nIn contrast, a low variance estimator\n\nwould yield similar results if applied to the two parts (consider \\(\\widehat{f} = 0\\)).\n\n\nBagging, short for bootstrap aggregation, is a general purpose procedure for reducing variance.\nWe’ll use it specifically in the context of trees, but it can be applied much more broadly."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality",
- "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-the-heuristic-motivation",
+ "href": "schedule/slides/19-bagging-and-rf.html#bagging-the-heuristic-motivation",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality?",
- "text": "How do we measure quality?\nBefore in regression, we have \\(y_i \\in \\mathbb{R}\\) and use squared error loss to measure accuracy: \\((y - \\hat{y})^2\\).\nInstead, let \\(y \\in \\mathcal{K} = \\{1,\\ldots, K\\}\\)\n(This is arbitrary, sometimes other numbers, such as \\(\\{-1,1\\}\\) will be used)\nWe can always take “factors”: \\(\\{\\textrm{cat},\\textrm{dog}\\}\\) and convert to integers, which is what we assume.\nWe again make predictions \\(\\hat{y}=k\\) based on the data\n\nWe get zero loss if we predict the right class\nWe lose \\(\\ell(k,k')\\) on \\((k\\neq k')\\) for incorrect predictions"
+ "section": "Bagging: The heuristic motivation",
+ "text": "Bagging: The heuristic motivation\nSuppose we have \\(n\\) uncorrelated observations \\(Z_1, \\ldots, Z_n\\), each with variance \\(\\sigma^2\\).\nWhat is the variance of\n\\[\\overline{Z} = \\frac{1}{n} \\sum_{i=1}^n Z_i\\ \\ \\ ?\\]\n\nSuppose we had \\(B\\) separate (uncorrelated) training sets, \\(1, \\ldots, B\\),\nWe can form \\(B\\) separate model fits, \\(\\widehat{f}^1(x), \\ldots, \\widehat{f}^B(x)\\), and then average them:\n\\[\\widehat{f}_{B}(x) = \\frac{1}{B} \\sum_{b=1}^B \\widehat{f}^b(x)\\]"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-1",
- "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-1",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-the-bootstrap-part",
+ "href": "schedule/slides/19-bagging-and-rf.html#bagging-the-bootstrap-part",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality?",
- "text": "How do we measure quality?\nSuppose you have a fever of 39º C. You get a rapid test on campus.\n\n\n\nLoss\nTest +\nTest -\n\n\n\n\nAre +\n0\nInfect others\n\n\nAre -\nIsolation\n0"
+ "section": "Bagging: The bootstrap part",
+ "text": "Bagging: The bootstrap part\n\nThis isn’t practical\n\nwe don’t have many training sets.\n\n\nWe therefore turn to the bootstrap to simulate having many training sets.\nSuppose we have data \\(Z_1, \\ldots, Z_n\\)\n\nChoose some large number of samples, \\(B\\).\nFor each \\(b = 1,\\ldots,B\\), resample from \\(Z_1, \\ldots, Z_n\\), call it \\(\\widetilde{Z}_1, \\ldots, \\widetilde{Z}_n\\).\nCompute \\(\\widehat{f}^b = \\widehat{f}(\\widetilde{Z}_1, \\ldots, \\widetilde{Z}_n)\\).\n\n\\[\\widehat{f}_{\\textrm{bag}}(x) = \\frac{1}{B} \\sum_{b=1}^B \\widehat{f}^b(x)\\]\nThis process is known as Bagging"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-2",
- "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-2",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-trees",
+ "href": "schedule/slides/19-bagging-and-rf.html#bagging-trees",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality?",
- "text": "How do we measure quality?\nSuppose you have a fever of 39º C. You get a rapid test on campus.\n\n\n\nLoss\nTest +\nTest -\n\n\n\n\nAre +\n0\n1\n\n\nAre -\n1\n0"
+ "section": "Bagging trees",
+ "text": "Bagging trees\n\n\n\n\n\nThe procedure for trees is the following\n\nChoose a large number \\(B\\).\nFor each \\(b = 1,\\ldots, B\\), grow an unpruned tree on the \\(b^{th}\\) bootstrap draw from the data.\nAverage all these trees together."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-3",
- "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-3",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-trees-1",
+ "href": "schedule/slides/19-bagging-and-rf.html#bagging-trees-1",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality?",
- "text": "How do we measure quality?\n\nWe’re going to use \\(g(x)\\) to be our classifier. It takes values in \\(\\mathcal{K}\\)."
+ "section": "Bagging trees",
+ "text": "Bagging trees\n\n\n\n\n\nEach tree, since it is unpruned, will have\n\nlow / high variance\nlow / high bias\n\nTherefore averaging many trees results in an estimator that has\n\nlower / higher variance and\nlow / high bias."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-4",
- "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-4",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-trees-variable-importance-measures",
+ "href": "schedule/slides/19-bagging-and-rf.html#bagging-trees-variable-importance-measures",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality?",
- "text": "How do we measure quality?\nAgain, we appeal to risk \\[R_n(g) = E [\\ell(Y,g(X))]\\] If we use the law of total probability, this can be written \\[R_n(g) = E_X \\sum_{y=1}^K \\ell(y,\\; g(X)) Pr(Y = y \\given X)\\] We minimize this over a class of options \\(\\mathcal{G}\\), to produce \\[g_*(X) = \\argmin_{g\\in\\mathcal{G}} E_X \\sum_{y=1}^K \\ell(y,g(X)) Pr(Y = y \\given X)\\]"
+ "section": "Bagging trees: Variable importance measures",
+ "text": "Bagging trees: Variable importance measures\nBagging can dramatically improve predictive performance of trees\nBut we sacrificed some interpretability.\nWe no longer have that nice diagram that shows the segmentation of the predictor space\n(more accurately, we have \\(B\\) of them).\nTo recover some information, we can do the following:\n\nFor each of the \\(b\\) trees and each of the \\(p\\) variables, we record the amount that the Gini index is reduced by the addition of that variable\nReport the average reduction over all \\(B\\) trees."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-5",
- "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-5",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#random-forest",
+ "href": "schedule/slides/19-bagging-and-rf.html#random-forest",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality?",
- "text": "How do we measure quality?\n\\(g_*\\) is named the Bayes’ classifier for loss \\(\\ell\\) in class \\(\\mathcal{G}\\).\n\\(R_n(g_*)\\) is the called the Bayes’ limit or Bayes’ Risk.\nIt’s the best we could hope to do in terms of \\(\\ell\\) if we knew the distribution of the data.\n\nBut we don’t, so we’ll try to do our best to estimate \\(g_*\\)."
+ "section": "Random Forest",
+ "text": "Random Forest\nRandom Forest is an extension of Bagging, in which the bootstrap trees are decorrelated.\nRemember: \\(\\Var{\\overline{Z}} = \\frac{1}{n}\\Var{Z_1}\\) unless the \\(Z_i\\)’s are correlated\nSo Bagging may not reduce the variance that much because the training sets are correlated across trees.\n\nHow do we decorrelate?\nDraw a bootstrap sample and start to build a tree.\n\nBut\n\nBefore we split, we randomly pick\n\n\n\\(m\\) of the possible \\(p\\) predictors as candidates for the split."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#best-classifier-overall",
- "href": "schedule/slides/14-classification-intro.html#best-classifier-overall",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#decorrelating",
+ "href": "schedule/slides/19-bagging-and-rf.html#decorrelating",
"title": "UBC Stat406 2023W",
- "section": "Best classifier overall",
- "text": "Best classifier overall\n(for now, we limit to 2 classes)\nOnce we make a specific choice for \\(\\ell\\), we can find \\(g_*\\) exactly (pretending we know the distribution)\nBecause \\(Y\\) takes only a few values, zero-one loss is natural (but not the only option) \\[\\ell(y,\\ g(x)) = \\begin{cases}0 & y=g(x)\\\\1 & y\\neq g(x) \\end{cases} \\Longrightarrow R_n(g) = \\Expect{\\ell(Y,\\ g(X))} = Pr(g(X) \\neq Y),\\]"
+ "section": "Decorrelating",
+ "text": "Decorrelating\nA new sample of size \\(m\\) of the predictors is taken at each split.\nUsually, we use about \\(m = \\sqrt{p}\\)\nIn other words, at each split, we aren’t even allowed to consider the majority of possible predictors!"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#best-classifier-overall-1",
- "href": "schedule/slides/14-classification-intro.html#best-classifier-overall-1",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#what-is-going-on-here",
+ "href": "schedule/slides/19-bagging-and-rf.html#what-is-going-on-here",
"title": "UBC Stat406 2023W",
- "section": "Best classifier overall",
- "text": "Best classifier overall\n\n\n\nLoss\nTest +\nTest -\n\n\n\n\nAre +\n0\n1\n\n\nAre -\n1\n0"
+ "section": "What is going on here?",
+ "text": "What is going on here?\nSuppose there is 1 really strong predictor and many mediocre ones.\n\nThen each tree will have this one predictor in it,\nTherefore, each tree will look very similar (i.e. highly correlated).\nAveraging highly correlated things leads to much less variance reduction than if they were uncorrelated.\n\nIf we don’t allow some trees/splits to use this important variable, each of the trees will be much less similar and hence much less correlated.\nBagging Trees is Random Forest when \\(m = p\\), that is, when we can consider all the variables at each split."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#best-classifier-overall-2",
- "href": "schedule/slides/14-classification-intro.html#best-classifier-overall-2",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data",
+ "href": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data",
"title": "UBC Stat406 2023W",
- "section": "Best classifier overall",
- "text": "Best classifier overall\nThis means we want to classify a new observation \\((x_0,y_0)\\) such that \\(g(x_0) = y_0\\) as often as possible\nUnder this loss, we have \\[\n\\begin{aligned}\ng_*(X) &= \\argmin_{g} Pr(g(X) \\neq Y) \\\\\n&= \\argmin_{g} \\left[ 1 - Pr(Y = g(x) | X=x)\\right] \\\\\n&= \\argmax_{g} Pr(Y = g(x) | X=x )\n\\end{aligned}\n\\]"
+ "section": "Example with Mobility data",
+ "text": "Example with Mobility data\n\nlibrary(randomForest)\nlibrary(kableExtra)\nset.seed(406406)\nmob <- Stat406::mobility |>\n mutate(mobile = as.factor(Mobility > .1)) |>\n select(-ID, -Name, -Mobility, -State) |>\n drop_na()\nn <- nrow(mob)\ntrainidx <- sample.int(n, floor(n * .75))\ntestidx <- setdiff(1:n, trainidx)\ntrain <- mob[trainidx, ]\ntest <- mob[testidx, ]\nrf <- randomForest(mobile ~ ., data = train)\nbag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)\npreds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))\n\nkbl(cbind(table(preds$truth, preds$rf), table(preds$truth, preds$bag))) |>\n add_header_above(c(\"Truth\" = 1, \"RF\" = 2, \"Bagging\" = 2))\n\n\n\n\n\n\n\n\n\n\n\n\nTruth\n\n\nRF\n\n\nBagging\n\n\n\n\nFALSE\nTRUE\nFALSE\nTRUE\n\n\n\n\nFALSE\n61\n10\n60\n11\n\n\nTRUE\n12\n22\n10\n24"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#estimating-g_",
- "href": "schedule/slides/14-classification-intro.html#estimating-g_",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data-1",
+ "href": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data-1",
"title": "UBC Stat406 2023W",
- "section": "Estimating \\(g_*\\)",
- "text": "Estimating \\(g_*\\)\nClassifier approach 1 (empirical risk minimization):\n\nChoose some class of classifiers \\(\\mathcal{G}\\).\nFind \\(\\argmin_{g\\in\\mathcal{G}} \\sum_{i = 1}^n I(g(x_i) \\neq y_i)\\)"
+ "section": "Example with Mobility data",
+ "text": "Example with Mobility data\n\nvarImpPlot(rf, pch = 16, col = orange)"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes",
- "href": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#one-last-thing",
+ "href": "schedule/slides/19-bagging-and-rf.html#one-last-thing",
"title": "UBC Stat406 2023W",
- "section": "Bayes’ Classifier and class densities (2 classes)",
- "text": "Bayes’ Classifier and class densities (2 classes)\nUsing Bayes’ theorem, and recalling that \\(f_*(X) = E[Y \\given X]\\)\n\\[\\begin{aligned}\nf_*(X) & = E[Y \\given X] = Pr(Y = 1 \\given X) \\\\\n&= \\frac{Pr(X\\given Y=1) Pr(Y=1)}{Pr(X)}\\\\\n& =\\frac{Pr(X\\given Y = 1) Pr(Y = 1)}{\\sum_{k \\in \\{0,1\\}} Pr(X\\given Y = k) Pr(Y = k)} \\\\ & = \\frac{p_1(X) \\pi}{ p_1(X)\\pi + p_0(X)(1-\\pi)}\\end{aligned}\\]\n\nWe call \\(p_k(X)\\) the class (conditional) densities\n\\(\\pi\\) is the marginal probability \\(P(Y=1)\\)"
+ "section": "One last thing…",
+ "text": "One last thing…\n\nOn average\n\ndrawing \\(n\\) samples from \\(n\\) observations with replacement (bootstrapping) results in ~ 2/3 of the observations being selected. (Can you show this?)\n\n\nThe remaining ~ 1/3 of the observations are not used on that tree.\nThese are referred to as out-of-bag (OOB).\nWe can think of it as a for-free cross-validation.\nEach time a tree is grown, we get its prediction error on the unused observations.\nWe average this over all bootstrap samples."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes-1",
- "href": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes-1",
+ "objectID": "schedule/slides/19-bagging-and-rf.html#out-of-bag-error-estimation-for-bagging-rf",
+ "href": "schedule/slides/19-bagging-and-rf.html#out-of-bag-error-estimation-for-bagging-rf",
"title": "UBC Stat406 2023W",
- "section": "Bayes’ Classifier and class densities (2 classes)",
- "text": "Bayes’ Classifier and class densities (2 classes)\nThe Bayes’ Classifier (best classifier for 0-1 loss) can be rewritten\n\\[g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}\\]\nApproach 2: estimate everything in the expression above.\n\nWe need to estimate \\(p_1\\), \\(p_2\\), \\(\\pi\\), \\(1-\\pi\\)\nEasily extended to more than two classes"
+ "section": "Out-of-bag error estimation for bagging / RF",
+ "text": "Out-of-bag error estimation for bagging / RF\nFor randomForest(), predict() without passing newdata = gives the OOB prediction\nnot like lm() where it gives the fitted values\n\ntab <- table(predict(bag), train$mobile) \nkbl(tab) |> add_header_above(c(\"Truth\" = 1, \"Bagging\" = 2))\n\n\n\n\n\n\n\n\n\n\nTruth\n\n\nBagging\n\n\n\n\nFALSE\nTRUE\n\n\n\n\nFALSE\n182\n28\n\n\nTRUE\n21\n82\n\n\n\n\n\n\n1 - sum(diag(tab)) / sum(tab) ## OOB misclassification error, no need for CV\n\n[1] 0.1565495"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#an-alternative-easy-classifier",
- "href": "schedule/slides/14-classification-intro.html#an-alternative-easy-classifier",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#meta-lecture",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "An alternative easy classifier",
- "text": "An alternative easy classifier\nZero-One loss was natural, but try something else\nLet’s try using squared error loss instead: \\(\\ell(y,\\ f(x)) = (y - f(x))^2\\)\nThen, the Bayes’ Classifier (the function that minimizes the Bayes Risk) is \\[g_*(x) = f_*(x) = E[ Y \\given X = x] = Pr(Y = 1 \\given X)\\] (recall that \\(f_* \\in [0,1]\\) is still the regression function)\nIn this case, our “class” will actually just be a probability. But this isn’t a class, so it’s a bit unsatisfying.\nHow do we get a class prediction?\n\nDiscretize the probability:\n\\[g(x) = \\begin{cases}0 & f_*(x) < 1/2\\\\1 & \\textrm{else}\\end{cases}\\]"
+ "section": "17 Nonlinear classifiers",
+ "text": "17 Nonlinear classifiers\nStat 406\nDaniel J. McDonald\nLast modified – 30 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#estimating-g_-1",
- "href": "schedule/slides/14-classification-intro.html#estimating-g_-1",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#last-time",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "Estimating \\(g_*\\)",
- "text": "Estimating \\(g_*\\)\nApproach 3:\n\nEstimate \\(f_*\\) using any method we’ve learned so far.\nPredict 0 if \\(\\hat{f}(x)\\) is less than 1/2, else predict 1."
+ "section": "Last time",
+ "text": "Last time\nWe reviewed logistic regression\n\\[\\begin{aligned}\nPr(Y = 1 \\given X=x) & = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}} \\\\\nPr(Y = 0 \\given X=x) & = \\frac{1}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}=1-\\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression",
- "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#make-it-nonlinear",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#make-it-nonlinear",
"title": "UBC Stat406 2023W",
- "section": "Claim: Classification is easier than regression",
- "text": "Claim: Classification is easier than regression\n\nLet \\(\\hat{f}\\) be any estimate of \\(f_*\\)\nLet \\(\\widehat{g} (x) = \\begin{cases}0 & \\hat f(x) < 1/2\\\\1 & else\\end{cases}\\)\n\nProof by picture."
+ "section": "Make it nonlinear",
+ "text": "Make it nonlinear\nWe can make LDA or logistic regression have non-linear decision boundaries by mapping the features to a higher dimension (just like with regular regression)\nSay:\nPolynomials\n\\((x_1, x_2) \\mapsto \\left(1,\\ x_1,\\ x_1^2,\\ x_2,\\ x_2^2,\\ x_1 x_2\\right)\\)\n\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2)) |> mutate(y = as.factor(y))\nlogit_poly <- glm(y ~ x1 * x2 + I(x1^2) + I(x2^2), dat1, family = \"binomial\")\nlda_poly <- lda(y ~ x1 * x2 + I(x1^2) + I(x2^2), dat1)"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-1",
- "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-1",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#visualizing-the-classification-boundary",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#visualizing-the-classification-boundary",
"title": "UBC Stat406 2023W",
- "section": "Claim: Classification is easier than regression",
- "text": "Claim: Classification is easier than regression\n\n\nCode\nset.seed(12345)\nx <- 1:99 / 100\ny <- rbinom(99, 1, \n .25 + .5 * (x > .3 & x < .5) + \n .6 * (x > .7))\ndmat <- as.matrix(dist(x))\nksm <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) \n sweep(gg, 1, rowSums(gg), '/') %*% y\n}\nfstar <- ksm(.04)\ngg <- tibble(x = x, fstar = fstar, y = y) %>%\n ggplot(aes(x)) +\n geom_point(aes(y = y), color = blue) +\n geom_line(aes(y = fstar), color = orange, size = 2) +\n coord_cartesian(ylim = c(0,1), xlim = c(0,1)) +\n annotate(\"label\", x = .75, y = .65, label = \"f_star\", size = 5)\ngg"
+ "section": "Visualizing the classification boundary",
+ "text": "Visualizing the classification boundary\n\n\nCode\nlibrary(cowplot)\ngr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), x2 = seq(-2.5, 3, length.out = 100))\npts_logit <- predict(logit_poly, gr)\npts_lda <- predict(lda_poly, gr)\ng0 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts_logit), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_viridis_b(n.breaks = 6, alpha = .5, name = \"log odds\") +\n ggtitle(\"Polynomial logit\") +\n theme(legend.position = \"bottom\", legend.key.width = unit(1.5, \"cm\"))\ng1 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts_lda$x), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_viridis_b(n.breaks = 6, alpha = .5, name = bquote(delta[1] - delta[0])) +\n ggtitle(\"Polynomial lda\") +\n theme(legend.position = \"bottom\", legend.key.width = unit(1.5, \"cm\"))\nplot_grid(g0, g1)\n\n\n\nA linear decision boundary in the higher-dimensional space corresponds to a non-linear decision boundary in low dimensions."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-2",
- "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-2",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#trees-reforestation",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#trees-reforestation",
"title": "UBC Stat406 2023W",
- "section": "Claim: Classification is easier than regression",
- "text": "Claim: Classification is easier than regression\n\n\nCode\ngg + geom_hline(yintercept = .5, color = green)"
+ "section": "Trees (reforestation)",
+ "text": "Trees (reforestation)\n\n\nWe saw regression trees last module\nClassification trees are\n\nMore natural\nSlightly different computationally\n\nEverything else is pretty much the same"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-3",
- "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-3",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#axis-parallel-splits",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#axis-parallel-splits",
"title": "UBC Stat406 2023W",
- "section": "Claim: Classification is easier than regression",
- "text": "Claim: Classification is easier than regression\n\n\nCode\ntib <- tibble(x = x, fstar = fstar, y = y)\nggplot(tib) +\n geom_vline(data = filter(tib, fstar > 0.5), aes(xintercept = x), alpha = .5, color = green) +\n annotate(\"label\", x = .75, y = .65, label = \"f_star\", size = 5) + \n geom_point(aes(x = x, y = y), color = blue) +\n geom_line(aes(x = x, y = fstar), color = orange, size = 2) +\n coord_cartesian(ylim = c(0,1), xlim = c(0,1))"
+ "section": "Axis-parallel splits",
+ "text": "Axis-parallel splits\nLike with regression trees, classification trees operate by greedily splitting the predictor space\n\n\n\nnames(bakeoff)\n\n [1] \"winners\" \n [2] \"series\" \n [3] \"age\" \n [4] \"occupation\" \n [5] \"hometown\" \n [6] \"percent_star\" \n [7] \"percent_technical_wins\" \n [8] \"percent_technical_bottom3\"\n [9] \"percent_technical_top3\" \n[10] \"technical_highest\" \n[11] \"technical_lowest\" \n[12] \"technical_median\" \n[13] \"judge1\" \n[14] \"judge2\" \n[15] \"viewers_7day\" \n[16] \"viewers_28day\" \n\n\n\nsmalltree <- tree(\n winners ~ technical_median + percent_star,\n data = bakeoff\n)\n\n\n\n\n\nCode\npar(mar = c(5, 5, 0, 0) + .1)\nplot(bakeoff$technical_median, bakeoff$percent_star,\n pch = c(\"-\", \"+\")[bakeoff$winners + 1], cex = 2, bty = \"n\", las = 1,\n ylab = \"% star baker\", xlab = \"times above median in technical\",\n col = orange, cex.axis = 2, cex.lab = 2\n)\npartition.tree(smalltree,\n add = TRUE, col = blue,\n ordvars = c(\"technical_median\", \"percent_star\")\n)"
},
{
- "objectID": "schedule/slides/14-classification-intro.html#how-to-find-a-classifier",
- "href": "schedule/slides/14-classification-intro.html#how-to-find-a-classifier",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#when-do-trees-do-well",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#when-do-trees-do-well",
"title": "UBC Stat406 2023W",
- "section": "How to find a classifier",
- "text": "How to find a classifier\nWhy did we go through that math?\nEach of these approaches suggests a way to find a classifier\n\nEmpirical risk minimization: Choose a set of classifiers \\(\\mathcal{G}\\) and find \\(g \\in \\mathcal{G}\\) that minimizes some estimate of \\(R_n(g)\\)\n\n\n(This can be quite challenging as, unlike in regression, the training error is nonconvex)\n\n\nDensity estimation: Estimate \\(\\pi\\) and \\(p_k\\)\nRegression: Find an estimate \\(\\hat{f}\\) of \\(f^*\\) and compare the predicted value to 1/2"
+ "section": "When do trees do well?",
+ "text": "When do trees do well?\n\n\n\n\n\n2D example\nTop Row:\ntrue decision boundary is linear\n🍎 linear classifier\n👎 tree with axis-parallel splits\nBottom Row:\ntrue decision boundary is non-linear\n🤮 A linear classifier can’t capture the true decision boundary\n🍎 decision tree is successful."
},
{
- "objectID": "schedule/slides/14-classification-intro.html#section",
- "href": "schedule/slides/14-classification-intro.html#section",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-build-a-tree",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-build-a-tree",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Easiest classifier when \\(y\\in \\{0,\\ 1\\}\\):\n(stupidest version of the third case…)\n\nghat <- round(predict(lm(y ~ ., data = trainingdata)))\n\nThink about why this may not be very good. (At least 2 reasons I can think of.)"
+ "section": "How do we build a tree?",
+ "text": "How do we build a tree?\n\nDivide the predictor space into \\(J\\) non-overlapping regions \\(R_1, \\ldots, R_J\\)\n\n\nthis is done via greedy, recursive binary splitting\n\n\nEvery observation that falls into a given region \\(R_j\\) is given the same prediction\n\n\ndetermined by majority (or plurality) vote in that region.\n\nImportant:\n\nTrees can only make rectangular regions that are aligned with the coordinate axis.\nThe fit is greedy, which means that after a split is made, all further decisions are conditional on that split."
},
{
- "objectID": "schedule/slides/12-why-smooth.html#meta-lecture",
- "href": "schedule/slides/12-why-smooth.html#meta-lecture",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-measure-quality-of-fit",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-measure-quality-of-fit",
"title": "UBC Stat406 2023W",
- "section": "12 To(o) smooth or not to(o) smooth?",
- "text": "12 To(o) smooth or not to(o) smooth?\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "How do we measure quality of fit?",
+ "text": "How do we measure quality of fit?\nLet \\(p_{mk}\\) be the proportion of training observations in the \\(m^{th}\\) region that are from the \\(k^{th}\\) class.\n\n\n\n\n\n\n\nclassification error rate:\n\\(E = 1 - \\max_k (\\widehat{p}_{mk})\\)\n\n\nGini index:\n\\(G = \\sum_k \\widehat{p}_{mk}(1-\\widehat{p}_{mk})\\)\n\n\ncross-entropy:\n\\(D = -\\sum_k \\widehat{p}_{mk}\\log(\\widehat{p}_{mk})\\)\n\n\n\nBoth Gini and cross-entropy measure the purity of the classifier (small if all \\(p_{mk}\\) are near zero or 1).\nThese are preferred over the classification error rate.\nClassification error is hard to optimize.\nWe build a classifier by growing a tree that minimizes \\(G\\) or \\(D\\)."
},
{
- "objectID": "schedule/slides/12-why-smooth.html#last-time",
- "href": "schedule/slides/12-why-smooth.html#last-time",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#pruning-the-tree",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#pruning-the-tree",
"title": "UBC Stat406 2023W",
- "section": "Last time…",
- "text": "Last time…\nWe’ve been discussing smoothing methods in 1-dimension:\n\\[\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R\\]\nWe looked at basis expansions, e.g.:\n\\[f(x) \\approx \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k\\]\nWe looked at local methods, e.g.:\n\\[f(x_i) \\approx s_i^\\top \\y\\]\n\nWhat if \\(x \\in \\R^p\\) and \\(p>1\\)?\n\n\n\nNote that \\(p\\) means the dimension of \\(x\\), not the dimension of the space of the polynomial basis or something else. That’s why I put \\(k\\) above."
+ "section": "Pruning the tree",
+ "text": "Pruning the tree\n\nCross-validation can be used to directly prune the tree,\nBut it is computationally expensive (combinatorial complexity).\nInstead, we use weakest link pruning, (Gini version)\n\n\\[\\sum_{m=1}^{|T|} \\sum_{k \\in R_m} \\widehat{p}_{mk}(1-\\widehat{p}_{mk}) + \\alpha |T|\\]\n\n\\(|T|\\) is the number of terminal nodes.\nEssentially, we are trading training fit (first term) with model complexity (second) term (compare to lasso).\nNow, cross-validation can be used to pick \\(\\alpha\\)."
},
{
- "objectID": "schedule/slides/12-why-smooth.html#kernels-and-interactions",
- "href": "schedule/slides/12-why-smooth.html#kernels-and-interactions",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#advantages-and-disadvantages-of-trees-again",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#advantages-and-disadvantages-of-trees-again",
"title": "UBC Stat406 2023W",
- "section": "Kernels and interactions",
- "text": "Kernels and interactions\nIn multivariate nonparametric regression, you estimate a surface over the input variables.\nThis is trying to find \\(\\widehat{f}(x_1,\\ldots,x_p)\\).\nTherefore, this function by construction includes interactions, handles categorical data, etc. etc.\nThis is in contrast with explicit linear models which need you to specify these things.\nThis extra complexity (automatically including interactions, as well as other things) comes with tradeoffs.\n\nMore complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance."
+ "section": "Advantages and disadvantages of trees (again)",
+ "text": "Advantages and disadvantages of trees (again)\n🎉 Trees are very easy to explain (much easier than even linear regression).\n🎉 Some people believe that decision trees mirror human decision.\n🎉 Trees can easily be displayed graphically no matter the dimension of the data.\n🎉 Trees can easily handle qualitative predictors without the need to create dummy variables.\n💩 Trees aren’t very good at prediction.\n💩 Trees are highly variable. Small changes in training data \\(\\Longrightarrow\\) big changes in the tree.\nTo fix these last two, we can try to grow many trees and average their performance.\n\nWe do this next module"
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-1",
- "href": "schedule/slides/12-why-smooth.html#issue-1",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#knn-classifiers",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#knn-classifiers",
"title": "UBC Stat406 2023W",
- "section": "Issue 1",
- "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\n\n\n\n\n\n\nImportant\n\n\nyou don’t need to memorize these formulas but you should know the intuition\nthe constants don’t matter for the intuition, but they matter for a particular data set. We don’t know them. So you estimate this."
+ "section": "KNN classifiers",
+ "text": "KNN classifiers\n\nWe saw \\(k\\)-nearest neighbors in the last module.\n\n\nlibrary(class)\nknn3 <- knn(dat1[, -1], gr, dat1$y, k = 3)\n\n\n\nCode\ngr$nn03 <- knn3\nggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = knn3), aes(x1, x2, fill = disc), alpha = .5) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_manual(values = c(orange, blue), labels = c(\"0\", \"1\")) +\n theme(\n legend.position = \"bottom\", legend.title = element_blank(),\n legend.key.width = unit(2, \"cm\")\n )"
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-1-1",
- "href": "schedule/slides/12-why-smooth.html#issue-1-1",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#choosing-k-is-very-important",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#choosing-k-is-very-important",
"title": "UBC Stat406 2023W",
- "section": "Issue 1",
- "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nRecall, this decomposition is squared bias + variance + irreducible error\n\nIt depends on the choice of \\(h\\)\n\n\\[\\textrm{MSE}(\\hat{f}) = C_1 h^4 + \\frac{C_2}{nh} + \\sigma^2\\]\n\nUsing \\(h = cn^{-1/5}\\) balances squared bias and variance, leads to the above rate. (That balance minimizes the MSE)"
+ "section": "Choosing \\(k\\) is very important",
+ "text": "Choosing \\(k\\) is very important\n\n\nCode\nset.seed(406406406)\nks <- c(1, 2, 5, 10, 20)\nnn <- map(ks, ~ as_tibble(knn(dat1[, -1], gr[, 1:2], dat1$y, .x)) |> \n set_names(sprintf(\"k = %02s\", .x))) |>\n list_cbind() |>\n bind_cols(gr)\npg <- pivot_longer(nn, starts_with(\"k =\"), names_to = \"k\", values_to = \"knn\")\n\nggplot(pg, aes(x1, x2)) +\n geom_raster(aes(fill = knn), alpha = .6) +\n facet_wrap(~ k) +\n scale_fill_manual(values = c(orange, green), labels = c(\"0\", \"1\")) +\n geom_point(data = dat1, mapping = aes(x1, x2, shape = as.factor(y)), size = 4) +\n theme_bw(base_size = 18) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n theme(\n legend.title = element_blank(),\n legend.key.height = unit(3, \"cm\")\n )\n\n\n\n\nHow should we choose \\(k\\)?\nScaling is also very important. “Nearness” is determined by distance, so better to standardize your data first.\nIf there are ties, break randomly. So even \\(k\\) is strange."
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-1-2",
- "href": "schedule/slides/12-why-smooth.html#issue-1-2",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#knn.cv-leave-one-out",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#knn.cv-leave-one-out",
"title": "UBC Stat406 2023W",
- "section": "Issue 1",
- "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nIntuition:\nas you collect data, use a smaller bandwidth and the MSE (on future data) decreases"
+ "section": "knn.cv() (leave one out)",
+ "text": "knn.cv() (leave one out)\n\nkmax <- 20\nerr <- map_dbl(1:kmax, ~ mean(knn.cv(dat1[, -1], dat1$y, k = .x) != dat1$y))\n\n\nI would use the largest (odd) k that is close to the minimum.\nThis produces simpler, smoother, decision boundaries."
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-1-3",
- "href": "schedule/slides/12-why-smooth.html#issue-1-3",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#alternative-using-deviance-loss-i-think-this-is-right",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#alternative-using-deviance-loss-i-think-this-is-right",
"title": "UBC Stat406 2023W",
- "section": "Issue 1",
- "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nHow does this compare to just using a linear model?\nBias\n\nThe bias of using a linear model when the truth nonlinear is a number \\(b > 0\\) which doesn’t depend on \\(n\\).\nThe bias of using kernel regression is \\(C_1/n^{4/5}\\). This goes to 0 as \\(n\\rightarrow\\infty\\).\n\nVariance\n\nThe variance of using a linear model is \\(C/n\\) no matter what\nThe variance of using kernel regression is \\(C_2/n^{4/5}\\)."
+ "section": "Alternative (using deviance loss, I think this is right)",
+ "text": "Alternative (using deviance loss, I think this is right)\n\n\nCode\ndev <- function(y, prob, prob_min = 1e-5) {\n y <- as.numeric(as.factor(y)) - 1 # 0/1 valued\n m <- mean(y)\n prob_max <- 1 - prob_min\n prob <- pmin(pmax(prob, prob_min), prob_max)\n lp <- (1 - y) * log(1 - prob) + y * log(prob)\n ly <- (1 - y) * log(1 - m) + y * log(m)\n 2 * (ly - lp)\n}\nknn.cv_probs <- function(train, cl, k = 1) {\n o <- knn.cv(train, cl, k = k, prob = TRUE)\n p <- attr(o, \"prob\")\n o <- as.numeric(as.factor(o)) - 1\n p[o == 0] <- 1 - p[o == 0]\n p\n}\ndev_err <- map_dbl(1:kmax, ~ mean(dev(dat1$y, knn.cv_probs(dat1[, -1], dat1$y, k = .x))))"
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-1-4",
- "href": "schedule/slides/12-why-smooth.html#issue-1-4",
+ "objectID": "schedule/slides/17-nonlinear-classifiers.html#final-version",
+ "href": "schedule/slides/17-nonlinear-classifiers.html#final-version",
"title": "UBC Stat406 2023W",
- "section": "Issue 1",
- "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nTo conclude:\n\nbias of kernels goes to zero, bias of lines doesn’t (unless the truth is linear).\nbut variance of lines goes to zero faster than for kernels.\n\nIf the linear model is right, you win.\nBut if it’s wrong, you (eventually) lose as \\(n\\) grows.\nHow do you know if you have enough data?\nCompare of the kernel version with CV-selected tuning parameter with the estimate of the risk for the linear model."
+ "section": "Final version",
+ "text": "Final version\n\n\n\n\nCode\nkopt <- max(which(err == min(err)))\nkopt <- kopt + 1 * (kopt %% 2 == 0)\ngr$opt <- knn(dat1[, -1], gr[, 1:2], dat1$y, k = kopt)\ntt <- table(knn(dat1[, -1], dat1[, -1], dat1$y, k = kopt), dat1$y, dnn = c(\"predicted\", \"truth\"))\nggplot(dat1, aes(x1, x2)) +\n theme_bw(base_size = 24) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = gr, aes(x1, x2, fill = opt), alpha = .6) +\n geom_point(aes(shape = y), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_manual(values = c(orange, green), labels = c(\"0\", \"1\")) +\n theme(\n legend.position = \"bottom\", legend.title = element_blank(),\n legend.key.width = unit(2, \"cm\")\n )\n\n\n\n\n\n\n\n\n\n\n\n\nBest \\(k\\): 19\nMisclassification error: 0.17\nConfusion matrix:\n\n\n\n truth\npredicted 1 2\n 1 41 6\n 2 11 42"
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-2",
- "href": "schedule/slides/12-why-smooth.html#issue-2",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#meta-lecture",
+ "href": "schedule/slides/15-LDA-and-QDA.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Issue 2",
- "text": "Issue 2\nFor \\(p>1\\), there is more trouble.\nFirst, lets look again at \\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nThat is for \\(p=1\\). It’s not that much slower than \\(C/n\\), the variance for linear models.\nIf \\(p>1\\) similar calculations show,\n\\[\\textrm{MSE}(\\hat f) = \\frac{C_1+C_2}{n^{4/(4+p)}} + \\sigma^2 \\hspace{2em} \\textrm{MSE}(\\hat \\beta) = b + \\frac{Cp}{n} + \\sigma^2 .\\]"
+ "section": "15 LDA and QDA",
+ "text": "15 LDA and QDA\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/12-why-smooth.html#issue-2-1",
- "href": "schedule/slides/12-why-smooth.html#issue-2-1",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#last-time",
+ "href": "schedule/slides/15-LDA-and-QDA.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "Issue 2",
- "text": "Issue 2\n\\[\\textrm{MSE}(\\hat f) = \\frac{C_1+C_2}{n^{4/(4+p)}} + \\sigma^2 \\hspace{2em} \\textrm{MSE}(\\hat \\beta) = b + \\frac{Cp}{n} + \\sigma^2 .\\]\nWhat if \\(p\\) is big (and \\(n\\) is really big)?\n\nThen \\((C_1 + C_2) / n^{4/(4+p)}\\) is still big.\nBut \\(Cp / n\\) is small.\nSo unless \\(b\\) is big, we should use the linear model.\n\nHow do you tell? Do model selection to decide.\nA very, very questionable rule of thumb: if \\(p>\\log(n)\\), don’t do smoothing."
+ "section": "Last time",
+ "text": "Last time\nWe showed that with two classes, the Bayes’ classifier is\n\\[g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}\\]\nwhere \\(p_1(X) = Pr(X \\given Y=1)\\), \\(p_0(X) = Pr(X \\given Y=0)\\) and \\(\\pi = Pr(Y=1)\\)\n\nFor more than two classes.\n\\[g_*(X) =\n\\argmax_k \\frac{\\pi_k p_k(X)}{\\sum_k \\pi_k p_k(X)}\\]\nwhere \\(p_k(X) = Pr(X \\given Y=k)\\) and \\(\\pi_k = P(Y=k)\\)"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#meta-lecture",
- "href": "schedule/slides/10-basis-expansions.html#meta-lecture",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#estimating-these",
+ "href": "schedule/slides/15-LDA-and-QDA.html#estimating-these",
"title": "UBC Stat406 2023W",
- "section": "10 Basis expansions",
- "text": "10 Basis expansions\nStat 406\nDaniel J. McDonald\nLast modified – 27 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Estimating these",
+ "text": "Estimating these\nLet’s make some assumptions:\n\n\\(Pr(X\\given Y=k) = \\mbox{N}(\\mu_k,\\Sigma_k)\\)\n\\(\\Sigma_k = \\Sigma_{k'} = \\Sigma\\)\n\n\nThis leads to Linear Discriminant Analysis (LDA), one of the oldest classifiers"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#what-about-nonlinear-things",
- "href": "schedule/slides/10-basis-expansions.html#what-about-nonlinear-things",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#lda",
+ "href": "schedule/slides/15-LDA-and-QDA.html#lda",
"title": "UBC Stat406 2023W",
- "section": "What about nonlinear things",
- "text": "What about nonlinear things\n\\[\\Expect{Y \\given X=x} = \\sum_{j=1}^p x_j\\beta_j\\]\nNow we relax this assumption of linearity:\n\\[\\Expect{Y \\given X=x} = f(x)\\]\nHow do we estimate \\(f\\)?\n\nFor this lecture, we use \\(x \\in \\R\\) (1 dimensional)\nHigher dimensions are possible, but complexity grows exponentially.\nWe’ll see some special techniques for \\(x\\in\\R^p\\) later this Module."
+ "section": "LDA",
+ "text": "LDA\n\nSplit your training data into \\(K\\) subsets based on \\(y_i=k\\).\nIn each subset, estimate the mean of \\(X\\): \\(\\widehat\\mu_k = \\overline{X}_k\\)\nEstimate the pooled variance: \\[\\widehat\\Sigma = \\frac{1}{n-K} \\sum_{k \\in \\mathcal{K}} \\sum_{i \\in k} (x_i - \\overline{X}_k) (x_i - \\overline{X}_k)^{\\top}\\]\nEstimate the class proportion: \\(\\widehat\\pi_k = n_k/n\\)"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#start-simple",
- "href": "schedule/slides/10-basis-expansions.html#start-simple",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#lda-1",
+ "href": "schedule/slides/15-LDA-and-QDA.html#lda-1",
"title": "UBC Stat406 2023W",
- "section": "Start simple",
- "text": "Start simple\nFor any \\(f : \\R \\rightarrow [0,1]\\)\n\\[f(x) = f(x_0) + f'(x_0)(x-x_0) + \\frac{1}{2}f''(x_0)(x-x_0)^2 + \\frac{1}{3!}f'''(x_0)(x-x_0)^3 + R_3(x-x_0)\\]\nSo we can linearly regress \\(y_i = f(x_i)\\) on the polynomials.\nThe more terms we use, the smaller \\(R\\).\n\n\nCode\nset.seed(406406)\ndata(arcuate, package = \"Stat406\") \narcuate <- arcuate |> slice_sample(n = 220)\narcuate %>% \n ggplot(aes(position, fa)) + \n geom_point(color = blue) +\n geom_smooth(color = orange, formula = y ~ poly(x, 3), method = \"lm\", se = FALSE)"
+ "section": "LDA",
+ "text": "LDA\nAssume just \\(K = 2\\) so \\(k \\in \\{0,\\ 1\\}\\)\nWe predict \\(\\widehat{y} = 1\\) if\n\\[\\widehat{p_1}(x) / \\widehat{p_0}(x) > \\widehat{\\pi_0} / \\widehat{\\pi_1}\\]\nPlug in the density estimates:\n\\[\\widehat{p_k}(x) = N(x - \\widehat{\\mu}_k,\\ \\widehat\\Sigma)\\]"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#same-thing-different-orders",
- "href": "schedule/slides/10-basis-expansions.html#same-thing-different-orders",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#lda-2",
+ "href": "schedule/slides/15-LDA-and-QDA.html#lda-2",
"title": "UBC Stat406 2023W",
- "section": "Same thing, different orders",
- "text": "Same thing, different orders\n\n\nCode\narcuate %>% \n ggplot(aes(position, fa)) + \n geom_point(color = blue) + \n geom_smooth(aes(color = \"a\"), formula = y ~ poly(x, 4), method = \"lm\", se = FALSE) +\n geom_smooth(aes(color = \"b\"), formula = y ~ poly(x, 7), method = \"lm\", se = FALSE) +\n geom_smooth(aes(color = \"c\"), formula = y ~ poly(x, 25), method = \"lm\", se = FALSE) +\n scale_color_manual(name = \"Taylor order\",\n values = c(green, red, orange), labels = c(\"4 terms\", \"7 terms\", \"25 terms\"))"
+ "section": "LDA",
+ "text": "LDA\nNow we take \\(\\log\\) and simplify \\((K=2)\\):\n\\[\n\\begin{aligned}\n&\\Rightarrow \\log(\\widehat{p_1}(x)\\times\\widehat{\\pi_1}) - \\log(\\widehat{p_0}(x)\\times\\widehat{\\pi_0})\n= \\cdots = \\cdots\\\\\n&= \\underbrace{\\left(x^\\top\\widehat\\Sigma^{-1}\\overline X_1-\\frac{1}{2}\\overline X_1^\\top \\widehat\\Sigma^{-1}\\overline X_1 + \\log \\widehat\\pi_1\\right)}_{\\delta_1(x)} - \\underbrace{\\left(x^\\top\\widehat\\Sigma^{-1}\\overline X_0-\\frac{1}{2}\\overline X_0^\\top \\widehat\\Sigma^{-1}\\overline X_0 + \\log \\widehat\\pi_0\\right)}_{\\delta_0(x)}\\\\\n&= \\delta_1(x) - \\delta_0(x)\n\\end{aligned}\n\\]\nIf \\(\\delta_1(x) > \\delta_0(x)\\), we set \\(\\widehat g(x)=1\\)"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#still-a-linear-smoother",
- "href": "schedule/slides/10-basis-expansions.html#still-a-linear-smoother",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#one-dimensional-intuition",
+ "href": "schedule/slides/15-LDA-and-QDA.html#one-dimensional-intuition",
"title": "UBC Stat406 2023W",
- "section": "Still a “linear smoother”",
- "text": "Still a “linear smoother”\nReally, this is still linear regression, just in a transformed space.\nIt’s not linear in \\(x\\), but it is linear in \\((x,x^2,x^3)\\) (for the 3rd-order case)\nSo, we’re still doing OLS with\n\\[\\X=\\begin{bmatrix}1& x_1 & x_1^2 & x_1^3 \\\\ \\vdots&&&\\vdots\\\\1& x_n & x_n^2 & x_n^3\\end{bmatrix}\\]\nSo we can still use our nice formulas for LOO-CV, GCV, Cp, AIC, etc.\n\nmax_deg <- 20\ncv_nice <- function(mdl) mean( residuals(mdl)^2 / (1 - hatvalues(mdl))^2 ) \ncvscores <- map_dbl(seq_len(max_deg), ~ cv_nice(lm(fa ~ poly(position, .), data = arcuate)))"
+ "section": "One dimensional intuition",
+ "text": "One dimensional intuition\n\nset.seed(406406406)\nn <- 100\npi <- .6\nmu0 <- -1\nmu1 <- 2\nsigma <- 2\ntib <- tibble(\n y = rbinom(n, 1, pi),\n x = rnorm(n, mu0, sigma) * (y == 0) + rnorm(n, mu1, sigma) * (y == 1)\n)\n\n\n\nCode\ngg <- ggplot(tib, aes(x, y)) +\n geom_point(colour = blue) +\n stat_function(fun = ~ 6 * (1 - pi) * dnorm(.x, mu0, sigma), colour = orange) +\n stat_function(fun = ~ 6 * pi * dnorm(.x, mu1, sigma), colour = orange) +\n annotate(\"label\",\n x = c(-3, 4.5), y = c(.5, 2 / 3),\n label = c(\"(1-pi)*p[0](x)\", \"pi*p[1](x)\"), parse = TRUE\n )\ngg"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#section",
- "href": "schedule/slides/10-basis-expansions.html#section",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#what-is-linear",
+ "href": "schedule/slides/15-LDA-and-QDA.html#what-is-linear",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Code\nlibrary(cowplot)\ng1 <- ggplot(tibble(cvscores, degrees = seq(max_deg)), aes(degrees, cvscores)) +\n geom_point(colour = blue) +\n geom_line(colour = blue) + \n labs(ylab = 'LOO-CV', xlab = 'polynomial degree') +\n geom_vline(xintercept = which.min(cvscores), linetype = \"dotted\") \ng2 <- ggplot(arcuate, aes(position, fa)) + \n geom_point(colour = blue) + \n geom_smooth(\n colour = orange, \n formula = y ~ poly(x, which.min(cvscores)), \n method = \"lm\", \n se = FALSE\n )\nplot_grid(g1, g2, ncol = 2)"
+ "section": "What is linear?",
+ "text": "What is linear?\nLook closely at the equation for \\(\\delta_1(x)\\):\n\\[\\delta_1(x)=x^\\top\\widehat\\Sigma^{-1}\\overline X_1-\\frac{1}{2}\\overline X_1^\\top \\widehat\\Sigma^{-1}\\overline X_1 + \\log \\widehat\\pi_1\\]\nWe can write this as \\(\\delta_1(x) = x^\\top a_1 + b_1\\) with \\(a_1 = \\widehat\\Sigma^{-1}\\overline X_1\\) and \\(b_1=-\\frac{1}{2}\\overline X_1^\\top \\widehat\\Sigma^{-1}\\overline X_1 + \\log \\widehat\\pi_1\\).\nWe can do the same for \\(\\delta_0(x)\\) (in terms of \\(a_0\\) and \\(b_0\\))\nTherefore,\n\\[\\delta_1(x)-\\delta_0(x) = x^\\top(a_1-a_0) + (b_1-b_0)\\]\nThis is how we discriminate between the classes.\nWe just calculate \\((a_1 - a_0)\\) (a vector in \\(\\R^p\\)), and \\(b_1 - b_0\\) (a scalar)"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#other-bases",
- "href": "schedule/slides/10-basis-expansions.html#other-bases",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#baby-example",
+ "href": "schedule/slides/15-LDA-and-QDA.html#baby-example",
"title": "UBC Stat406 2023W",
- "section": "Other bases",
- "text": "Other bases\n\nPolynomials\n\n\\(x \\mapsto \\left(1,\\ x,\\ x^2, \\ldots, x^p\\right)\\) (technically, not quite this, they are orthogonalized)\n\nLinear splines\n\n\\(x \\mapsto \\bigg(1,\\ x,\\ (x-k_1)_+,\\ (x-k_2)_+,\\ldots, (x-k_p)_+\\bigg)\\) for some choices \\(\\{k_1,\\ldots,k_p\\}\\)\n\nCubic splines\n\n\\(x \\mapsto \\bigg(1,\\ x,\\ x^2,\\ x^3,\\ (x-k_1)^3_+,\\ (x-k_2)^3_+,\\ldots, (x-k_p)^3_+\\bigg)\\) for some choices \\(\\{k_1,\\ldots,k_p\\}\\)\n\nFourier series\n\n\\(x \\mapsto \\bigg(1,\\ \\cos(2\\pi x),\\ \\sin(2\\pi x),\\ \\cos(2\\pi 2 x),\\ \\sin(2\\pi 2 x), \\ldots, \\cos(2\\pi p x),\\ \\sin(2\\pi p x)\\bigg)\\)"
+ "section": "Baby example",
+ "text": "Baby example\n\n\n\nlibrary(mvtnorm)\nlibrary(MASS)\ngenerate_lda_2d <- function(\n n, p = c(.5, .5), \n mu = matrix(c(0, 0, 1, 1), 2),\n Sigma = diag(2)) {\n X <- rmvnorm(n, sigma = Sigma)\n tibble(\n y = which(rmultinom(n, 1, p) == 1, TRUE)[,1],\n x1 = X[, 1] + mu[1, y],\n x2 = X[, 2] + mu[2, y]\n )\n}\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))\nlda_fit <- lda(y ~ ., dat1)"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#how-do-you-choose",
- "href": "schedule/slides/10-basis-expansions.html#how-do-you-choose",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#multiple-classes",
+ "href": "schedule/slides/15-LDA-and-QDA.html#multiple-classes",
"title": "UBC Stat406 2023W",
- "section": "How do you choose?",
- "text": "How do you choose?\nProcedure 1:\n\nPick your favorite basis. This is not as easy as it sounds. For instance, if \\(f\\) is a step function, linear splines will do well with good knots, but polynomials will be terrible unless you have lots of terms.\nPerform OLS on different orders.\nUse model selection criterion to choose the order.\n\nProcedure 2:\n\nUse a bunch of high-order bases, say Linear splines and Fourier series and whatever else you like.\nUse Lasso or Ridge regression or elastic net. (combining bases can lead to multicollinearity, but we may not care)\nUse model selection criteria to choose the tuning parameter."
+ "section": "Multiple classes",
+ "text": "Multiple classes\n\nmoreclasses <- generate_lda_2d(150, c(.2, .3, .5), matrix(c(0, 0, 1, 1, 1, 0), 2), .5 * diag(2))\nseparateclasses <- generate_lda_2d(150, c(.2, .3, .5), matrix(c(-1, -1, 2, 2, 2, -1), 2), .1 * diag(2))"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#try-both-procedures",
- "href": "schedule/slides/10-basis-expansions.html#try-both-procedures",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#qda",
+ "href": "schedule/slides/15-LDA-and-QDA.html#qda",
"title": "UBC Stat406 2023W",
- "section": "Try both procedures",
- "text": "Try both procedures\n\nSplit arcuate into 75% training data and 25% testing data.\nEstimate polynomials up to 20 as before and choose best order.\nDo ridge, lasso and elastic net \\(\\alpha=.5\\) on 20th order polynomials, B splines with 20 knots, and Fourier series with \\(p=20\\). Choose tuning parameter (using lambda.1se).\nRepeat 1-3 10 times (different splits)"
+ "section": "QDA",
+ "text": "QDA\nJust like LDA, but \\(\\Sigma_k\\) is separate for each class.\nProduces Quadratic decision boundary.\nEverything else is the same.\n\nqda_fit <- qda(y ~ ., dat1)\nqda_3fit <- qda(y ~ ., moreclasses)"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#section-1",
- "href": "schedule/slides/10-basis-expansions.html#section-1",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#class-comparison",
+ "href": "schedule/slides/15-LDA-and-QDA.html#class-comparison",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "library(glmnet)\nmapto01 <- function(x, pad = .005) (x - min(x) + pad) / (max(x) - min(x) + 2 * pad)\nx <- mapto01(arcuate$position)\nXmat <- cbind(\n poly(x, 20), \n splines::bs(x, df = 20), \n cos(2 * pi * outer(x, 1:20)), sin(2 * pi * outer(x, 1:20))\n)\ny <- arcuate$fa\nrmse <- function(z, s) sqrt(mean( (z - s)^2 ))\nnzero <- function(x) with(x, nzero[match(lambda.1se, lambda)])\nsim <- function(maxdeg = 20, train_frac = 0.75) {\n n <- nrow(arcuate)\n train <- as.logical(rbinom(n, 1, train_frac))\n test <- !train # not precisely 25%, but on average\n polycv <- map_dbl(seq(maxdeg), ~ cv_nice(lm(y ~ Xmat[,seq(.)], subset = train))) # figure out which order to use\n bpoly <- lm(y[train] ~ Xmat[train, seq(which.min(polycv))]) # now use it\n lasso <- cv.glmnet(Xmat[train, ], y[train])\n ridge <- cv.glmnet(Xmat[train, ], y[train], alpha = 0)\n elnet <- cv.glmnet(Xmat[train, ], y[train], alpha = .5)\n tibble(\n methods = c(\"poly\", \"lasso\", \"ridge\", \"elnet\"),\n rmses = c(\n rmse(y[test], cbind(1, Xmat[test, 1:which.min(polycv)]) %*% coef(bpoly)),\n rmse(y[test], predict(lasso, Xmat[test,])),\n rmse(y[test], predict(ridge, Xmat[test,])),\n rmse(y[test], predict(elnet, Xmat[test,]))\n ),\n nvars = c(which.min(polycv), nzero(lasso), nzero(ridge), nzero(elnet))\n )\n}\nset.seed(12345)\nsim_results <- map(seq(20), sim) |> list_rbind() # repeat it 20 times"
+ "section": "3 class comparison",
+ "text": "3 class comparison"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#section-2",
- "href": "schedule/slides/10-basis-expansions.html#section-2",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#notes",
+ "href": "schedule/slides/15-LDA-and-QDA.html#notes",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Code\nsim_results |> \n pivot_longer(-methods) |> \n ggplot(aes(methods, value, fill = methods)) + \n geom_boxplot() +\n facet_wrap(~ name, scales = \"free_y\") + \n ylab(\"\") +\n theme(legend.position = \"none\") + \n xlab(\"\") +\n scale_fill_viridis_d(begin = .2, end = 1)"
+ "section": "Notes",
+ "text": "Notes\n\nLDA is a linear classifier. It is not a linear smoother.\n\nIt is derived from Bayes rule.\nAssume each class-conditional density in Gaussian\nIt assumes the classes have different mean vectors, but the same (common) covariance matrix.\nIt estimates densities and probabilities and “plugs in”\n\nQDA is not a linear classifier. It depends on quadratic functions of the data.\n\nIt is derived from Bayes rule.\nAssume each class-conditional density in Gaussian\nIt assumes the classes have different mean vectors and different covariance matrices.\nIt estimates densities and probabilities and “plugs in”"
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#common-elements",
- "href": "schedule/slides/10-basis-expansions.html#common-elements",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#section",
+ "href": "schedule/slides/15-LDA-and-QDA.html#section",
"title": "UBC Stat406 2023W",
- "section": "Common elements",
- "text": "Common elements\nIn all these cases, we transformed \\(x\\) to a higher-dimensional space\nUsed \\(p+1\\) dimensions with polynomials\nUsed \\(p+4\\) dimensions with cubic splines\nUsed \\(2p+1\\) dimensions with Fourier basis"
+ "section": "",
+ "text": "It is hard (maybe impossible) to come up with reasonable classifiers that are linear smoothers. Many “look” like a linear smoother, but then apply a nonlinear transformation."
},
{
- "objectID": "schedule/slides/10-basis-expansions.html#featurization",
- "href": "schedule/slides/10-basis-expansions.html#featurization",
+ "objectID": "schedule/slides/15-LDA-and-QDA.html#naïve-bayes",
+ "href": "schedule/slides/15-LDA-and-QDA.html#naïve-bayes",
"title": "UBC Stat406 2023W",
- "section": "Featurization",
- "text": "Featurization\nEach case applied a feature map to \\(x\\), call it \\(\\Phi\\)\nWe used new “features” \\(\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)\\)\nNeural networks (coming in module 4) use this idea\nYou’ve also probably seen it in earlier courses when you added interaction terms or other transformations.\n\nSome methods (notably Support Vector Machines and Ridge regression) allow \\(k=\\infty\\)\nSee [ISLR] 9.3.2 for baby overview or [ESL] 5.8 (note 😱)"
+ "section": "Naïve Bayes",
+ "text": "Naïve Bayes\nAssume that \\(Pr(X | Y = k) = Pr(X_1 | Y = k)\\cdots Pr(X_p | Y = k)\\).\nThat is, conditional on the class, the feature distribution is independent.\n\nIf we further assume that \\(Pr(X_j | Y = k)\\) is Gaussian,\nThis is the same as QDA but with \\(\\Sigma_k\\) Diagonal.\n\n\nDon’t have to assume Gaussian. Could do lots of stuff."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#meta-lecture",
- "href": "schedule/slides/08-ridge-regression.html#meta-lecture",
+ "objectID": "schedule/slides/13-gams-trees.html#meta-lecture",
+ "href": "schedule/slides/13-gams-trees.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "08 Ridge regression",
- "text": "08 Ridge regression\nStat 406\nDaniel J. McDonald\nLast modified – 27 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "13 GAMs and Trees",
+ "text": "13 GAMs and Trees\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#recap",
- "href": "schedule/slides/08-ridge-regression.html#recap",
+ "objectID": "schedule/slides/13-gams-trees.html#gams",
+ "href": "schedule/slides/13-gams-trees.html#gams",
"title": "UBC Stat406 2023W",
- "section": "Recap",
- "text": "Recap\nSo far, we have emphasized model selection as\nDecide which predictors we would like to use in our linear model\nOr similarly:\nDecide which of a few linear models to use\nTo do this, we used a risk estimate, and chose the “model” with the lowest estimate\n\nMoving forward, we need to generalize this to\nDecide which of possibly infinite prediction functions \\(f\\in\\mathcal{F}\\) to use\nThankfully, this isn’t really any different. We still use those same risk estimates.\nRemember: We were choosing models that balance bias and variance (and hence have low prediction risk).\n\\[\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "GAMs",
+ "text": "GAMs\nLast time we discussed smoothing in multiple dimensions.\nHere we introduce the concept of GAMs (Generalized Additive Models)\nThe basic idea is to imagine that the response is the sum of some functions of the predictors:\n\\[\\Expect{Y \\given X=x} = \\beta_0 + f_1(x_{1})+\\cdots+f_p(x_{p}).\\]\nNote that OLS is a GAM (take \\(f_j(x_{j})=\\beta_j x_{j}\\)):\n\\[\\Expect{Y \\given X=x} = \\beta_0 + \\beta_1 x_{1}+\\cdots+\\beta_p x_{p}.\\]"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#regularization",
- "href": "schedule/slides/08-ridge-regression.html#regularization",
+ "objectID": "schedule/slides/13-gams-trees.html#gams-1",
+ "href": "schedule/slides/13-gams-trees.html#gams-1",
"title": "UBC Stat406 2023W",
- "section": "Regularization",
- "text": "Regularization\n\nAnother way to control bias and variance is through regularization or shrinkage.\nRather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.\nI mean ALL.\nBut, make your estimates of \\(\\beta\\) “smaller”"
+ "section": "Gams",
+ "text": "Gams\nThese work by estimating each \\(f_i\\) using basis expansions in predictor \\(i\\)\nThe algorithm for fitting these things is called “backfitting” (very similar to the CD intuition for lasso):\n\nCenter \\(\\y\\) and \\(\\X\\).\nHold \\(f_k\\) for all \\(k\\neq j\\) fixed, and regress \\(\\X_j\\) on \\((\\y - \\widehat{\\y}_{-j})\\) using your favorite smoother.\nRepeat for \\(1\\leq j\\leq p\\).\nRepeat steps 2 and 3 until the estimated functions “stop moving” (iterate)\nReturn the results."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#brief-aside-on-optimization",
- "href": "schedule/slides/08-ridge-regression.html#brief-aside-on-optimization",
+ "objectID": "schedule/slides/13-gams-trees.html#very-small-example",
+ "href": "schedule/slides/13-gams-trees.html#very-small-example",
"title": "UBC Stat406 2023W",
- "section": "Brief aside on optimization",
- "text": "Brief aside on optimization\n\nAn optimization problem has 2 components:\n\nThe “Objective function”: e.g. \\(\\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\\).\nThe “constraint”: e.g. “fewer than 5 non-zero entries in \\(\\beta\\)”.\n\nA constrained minimization problem is written\n\n\\[\\min_\\beta f(\\beta)\\;\\; \\mbox{ subject to }\\;\\; C(\\beta)\\]\n\n\\(f(\\beta)\\) is the objective function\n\\(C(\\beta)\\) is the constraint"
+ "section": "Very small example",
+ "text": "Very small example\n\nlibrary(mgcv)\nset.seed(12345)\nn <- 500\nsimple <- tibble(\n x1 = runif(n, 0, 2*pi),\n x2 = runif(n),\n y = 5 + 2 * sin(x1) + 8 * sqrt(x2) + rnorm(n, sd = .25)\n)\n\npivot_longer(simple, -y, names_to = \"predictor\", values_to = \"x\") |>\n ggplot(aes(x, y)) +\n geom_point(col = blue) +\n facet_wrap(~predictor, scales = \"free_x\")"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-constrained-version",
- "href": "schedule/slides/08-ridge-regression.html#ridge-regression-constrained-version",
+ "objectID": "schedule/slides/13-gams-trees.html#very-small-example-1",
+ "href": "schedule/slides/13-gams-trees.html#very-small-example-1",
"title": "UBC Stat406 2023W",
- "section": "Ridge regression (constrained version)",
- "text": "Ridge regression (constrained version)\nOne way to do this for regression is to solve (say): \\[\n\\minimize_\\beta \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\n\\quad \\st \\sum_j \\beta^2_j < s\n\\] for some \\(s>0\\).\n\nThis is called “ridge regression”.\nCall the minimizer of this problem \\(\\brt\\)\n\n\nCompare this to ordinary least squares:\n\\[\n\\minimize_\\beta \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\n\\quad \\st \\beta \\in \\R^p\n\\]"
+ "section": "Very small example",
+ "text": "Very small example\nSmooth each coordinate independently\n\nex_smooth <- gam(y ~ s(x1) + s(x2), data = simple)\n# s(z) means \"smooth\" z, uses spline basis for each with ridge penalty, GCV\nplot(ex_smooth, pages = 1, scale = 0, shade = TRUE, \n resid = TRUE, se = 2, las = 1)\n\nhead(coef(ex_smooth))\n\n(Intercept) s(x1).1 s(x1).2 s(x1).3 s(x1).4 s(x1).5 \n 10.2070490 -4.5764100 0.7117161 0.4548928 0.5535001 -0.2092996 \n\nex_smooth$gcv.ubre\n\n GCV.Cp \n0.06619721"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#geometry-of-ridge-regression-contours",
- "href": "schedule/slides/08-ridge-regression.html#geometry-of-ridge-regression-contours",
+ "objectID": "schedule/slides/13-gams-trees.html#wherefore-gams",
+ "href": "schedule/slides/13-gams-trees.html#wherefore-gams",
"title": "UBC Stat406 2023W",
- "section": "Geometry of ridge regression (contours)",
- "text": "Geometry of ridge regression (contours)\n\n\nCode\nlibrary(mvtnorm)\nnorm_ball <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- tibble(x = cos(tg), b = (1 - abs(x)^q)^(1 / q), bm = -b) |>\n pivot_longer(-x, values_to = \"y\")\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipse_data <- function(\n n = 75, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n expand_grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)) |>\n rowwise() |>\n mutate(z = dmvnorm(c(x, y), mean, Sigma))\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6, niter = 20) {\n ed <- filter(ed, x > 0, y > 0)\n feasible <- (ed$x^q + ed$y^q)^(1 / q) <= 1\n best <- ed[feasible, ]\n best[which.max(best$z), ]\n}\n\n\nnb <- norm_ball(2)\ned <- ellipse_data()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 2)\nggplot(nb, aes(x, y)) +\n xlim(-2, 2) +\n ylim(-2, 2) +\n geom_path(colour = red) +\n geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal() +\n geom_label(\n data = bols,\n mapping = aes(label = bquote(\"hat(beta)[ols]\")),\n parse = TRUE, \n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n theme_bw(base_size = 24) +\n geom_label(\n data = bhat,\n mapping = aes(label = bquote(\"hat(beta)[s]^R\")),\n parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )"
+ "section": "Wherefore GAMs?",
+ "text": "Wherefore GAMs?\nIf\n\\(\\Expect{Y \\given X=x} = \\beta_0 + f_1(x_{1})+\\cdots+f_p(x_{p}),\\)\nthen\n\\(\\textrm{MSE}(\\hat f) = \\frac{Cp}{n^{4/5}} + \\sigma^2.\\)\n\nExponent no longer depends on \\(p\\). Converges faster. (If the truth is additive.)\nYou could also use the same methods to include “some” interactions like\n\n\\[\\begin{aligned}&\\Expect{Y \\given X=x}\\\\ &= \\beta_0 + f_{12}(x_{1},\\ x_{2})+f_3(x_3)+\\cdots+f_p(x_{p}),\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#brief-aside-on-norms",
- "href": "schedule/slides/08-ridge-regression.html#brief-aside-on-norms",
+ "objectID": "schedule/slides/13-gams-trees.html#very-small-example-2",
+ "href": "schedule/slides/13-gams-trees.html#very-small-example-2",
"title": "UBC Stat406 2023W",
- "section": "Brief aside on norms",
- "text": "Brief aside on norms\nRecall, for a vector \\(z \\in \\R^p\\)\n\\[\\snorm{z}_2 = \\sqrt{z_1^2 + z_2^2 + \\cdots + z^2_p} = \\sqrt{\\sum_{j=1}^p z_j^2}\\]\nSo,\n\\[\\snorm{z}^2_2 = z_1^2 + z_2^2 + \\cdots + z^2_p = \\sum_{j=1}^p z_j^2.\\]"
+ "section": "Very small example",
+ "text": "Very small example\nSmooth two coordinates together\n\nex_smooth2 <- gam(y ~ s(x1, x2), data = simple)\nplot(ex_smooth2,\n scheme = 2, scale = 0, shade = TRUE,\n resid = TRUE, se = 2, las = 1\n)"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#other-norms-we-should-remember",
- "href": "schedule/slides/08-ridge-regression.html#other-norms-we-should-remember",
+ "objectID": "schedule/slides/13-gams-trees.html#regression-trees",
+ "href": "schedule/slides/13-gams-trees.html#regression-trees",
"title": "UBC Stat406 2023W",
- "section": "Other norms we should remember:",
- "text": "Other norms we should remember:\n\n\\(\\ell_q\\)-norm\n\n\\(\\left(\\sum_{j=1}^p |z_j|^q\\right)^{1/q}\\)\n\n\\(\\ell_1\\)-norm (special case)\n\n\\(\\sum_{j=1}^p |z_j|\\)\n\n\\(\\ell_0\\)-norm\n\n\\(\\sum_{j=1}^p I(z_j \\neq 0 ) = \\lvert \\{j : z_j \\neq 0 \\}\\rvert\\)\n\n\\(\\ell_\\infty\\)-norm\n\n\\(\\max_{1\\leq j \\leq p} |z_j|\\)\n\n\n\n\nRecall what a norm is: https://en.wikipedia.org/wiki/Norm_(mathematics)"
+ "section": "Regression trees",
+ "text": "Regression trees\nTrees involve stratifying or segmenting the predictor space into a number of simple regions.\nTrees are simple and useful for interpretation.\nBasic trees are not great at prediction.\nModern methods that use trees are much better (Module 4)"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression",
- "href": "schedule/slides/08-ridge-regression.html#ridge-regression",
+ "objectID": "schedule/slides/13-gams-trees.html#regression-trees-1",
+ "href": "schedule/slides/13-gams-trees.html#regression-trees-1",
"title": "UBC Stat406 2023W",
- "section": "Ridge regression",
- "text": "Ridge regression\nAn equivalent way to write\n\\[\\brt = \\argmin_{ || \\beta ||_2^2 \\leq s} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\\]\nis in the Lagrangian form\n\\[\\brl = \\argmin_{ \\beta} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\lambda || \\beta ||_2^2.\\]\nFor every \\(\\lambda\\) there is a unique \\(s\\) (and vice versa) that makes\n\\[\\brt = \\brl\\]"
+ "section": "Regression trees",
+ "text": "Regression trees\nRegression trees estimate piece-wise constant functions\nThe slabs are axis-parallel rectangles \\(R_1,\\ldots,R_K\\) based on \\(\\X\\)\nIn each region, we average the \\(y_i\\)’s: \\(\\hat\\mu_1,\\ldots,\\hat\\mu_k\\)\nMinimize \\(\\sum_{k=1}^K \\sum_{i=1}^n (y_i-\\mu_k)^2\\) over \\(R_k,\\mu_k\\) for \\(k\\in \\{1,\\ldots,K\\}\\)\n\nThis sounds more complicated than it is.\nThe minimization is performed greedily (like forward stepwise regression)."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-1",
- "href": "schedule/slides/08-ridge-regression.html#ridge-regression-1",
+ "objectID": "schedule/slides/13-gams-trees.html#mobility-data",
+ "href": "schedule/slides/13-gams-trees.html#mobility-data",
"title": "UBC Stat406 2023W",
- "section": "Ridge regression",
- "text": "Ridge regression\n\\(\\brt = \\argmin_{ || \\beta ||_2^2 \\leq s} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\\)\n\\(\\brl = \\argmin_{ \\beta} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\lambda || \\beta ||_2^2\\)\nObserve:\n\n\\(\\lambda = 0\\) (or \\(s = \\infty\\)) makes \\(\\brl = \\bls\\)\nAny \\(\\lambda > 0\\) (or \\(s <\\infty\\)) penalizes larger values of \\(\\beta\\), effectively shrinking them.\n\n\\(\\lambda\\) and \\(s\\) are known as tuning parameters"
+ "section": "Mobility data",
+ "text": "Mobility data\n\nbigtree <- tree(Mobility ~ ., data = mob)\nsmalltree <- prune.tree(bigtree, k = .09)\ndraw.tree(smalltree, digits = 2)\n\n\nThis is called the dendrogram"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#visualizing-ridge-regression-2-coefficients",
- "href": "schedule/slides/08-ridge-regression.html#visualizing-ridge-regression-2-coefficients",
+ "objectID": "schedule/slides/13-gams-trees.html#partition-view",
+ "href": "schedule/slides/13-gams-trees.html#partition-view",
"title": "UBC Stat406 2023W",
- "section": "Visualizing ridge regression (2 coefficients)",
- "text": "Visualizing ridge regression (2 coefficients)\n\n\nCode\nb <- c(1, 1)\nn <- 1000\nlams <- c(1, 5, 10)\nols_loss <- function(b1, b2) colMeans((y - X %*% rbind(b1, b2))^2) / 2\npen <- function(b1, b2, lambda = 1) lambda * (b1^2 + b2^2) / 2\ngr <- expand_grid(\n b1 = seq(b[1] - 0.5, b[1] + 0.5, length.out = 100),\n b2 = seq(b[2] - 0.5, b[2] + 0.5, length.out = 100)\n)\n\nX <- mvtnorm::rmvnorm(n, c(0, 0), sigma = matrix(c(1, .3, .3, .5), nrow = 2))\ny <- drop(X %*% b + rnorm(n))\n\nbols <- coef(lm(y ~ X - 1))\nbridge <- coef(MASS::lm.ridge(y ~ X - 1, lambda = lams * sqrt(n)))\n\npenalties <- lams |>\n set_names(~ paste(\"lam =\", .)) |>\n map(~ pen(gr$b1, gr$b2, .x)) |>\n as_tibble()\ngr <- gr |>\n mutate(loss = ols_loss(b1, b2)) |>\n bind_cols(penalties)\n\ng1 <- ggplot(gr, aes(b1, b2)) +\n geom_raster(aes(fill = loss)) +\n scale_fill_viridis_c(direction = -1) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +\n theme(legend.position = \"bottom\") +\n guides(fill = guide_colourbar(barwidth = 20, barheight = 0.5))\n\ng2 <- gr |>\n pivot_longer(starts_with(\"lam\")) |>\n mutate(name = factor(name, levels = paste(\"lam =\", lams))) |>\n ggplot(aes(b1, b2)) +\n geom_raster(aes(fill = value)) +\n scale_fill_viridis_c(direction = -1, name = \"penalty\") +\n facet_wrap(~name, ncol = 1) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +\n theme(legend.position = \"bottom\") +\n guides(fill = guide_colourbar(barwidth = 10, barheight = 0.5))\n\ng3 <- gr |> \n mutate(across(starts_with(\"lam\"), ~ loss + .x)) |>\n pivot_longer(starts_with(\"lam\")) |>\n mutate(name = factor(name, levels = paste(\"lam =\", lams))) |>\n ggplot(aes(b1, b2)) +\n geom_raster(aes(fill = value)) +\n scale_fill_viridis_c(direction = -1, name = \"loss + pen\") +\n facet_wrap(~name, ncol = 1) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +\n theme(legend.position = \"bottom\") +\n guides(fill = guide_colourbar(barwidth = 10, barheight = 0.5))\n\ncowplot::plot_grid(g1, g2, g3, rel_widths = c(2, 1, 1), nrow = 1)"
+ "section": "Partition view",
+ "text": "Partition view\n\nmob$preds <- predict(smalltree)\npar(mfrow = c(1, 2), mar = c(5, 3, 0, 0))\ndraw.tree(smalltree, digits = 2)\ncols <- viridisLite::viridis(20, direction = -1)[cut(log(mob$Mobility), 20)]\nplot(mob$Black, mob$Commute,\n pch = 19, cex = .4, bty = \"n\", las = 1, col = cols,\n ylab = \"Commute time\", xlab = \"% Black\"\n)\npartition.tree(smalltree, add = TRUE, ordvars = c(\"Black\", \"Commute\"))\n\n\nWe predict all observations in a region with the same value.\n\\(\\bullet\\) The three regions correspond to the leaves of the tree."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#the-effect-on-the-estimates",
- "href": "schedule/slides/08-ridge-regression.html#the-effect-on-the-estimates",
+ "objectID": "schedule/slides/13-gams-trees.html#section-1",
+ "href": "schedule/slides/13-gams-trees.html#section-1",
"title": "UBC Stat406 2023W",
- "section": "The effect on the estimates",
- "text": "The effect on the estimates\n\n\nCode\ngr |> \n mutate(z = ols_loss(b1, b2) + max(lams) * pen(b1, b2)) |>\n ggplot(aes(b1, b2)) +\n geom_raster(aes(fill = z)) +\n scale_fill_viridis_c(direction = -1) +\n geom_point(data = tibble(\n b1 = c(bols[1], bridge[,1]),\n b2 = c(bols[2], bridge[,2]),\n estimate = factor(c(\"ols\", paste0(\"ridge = \", lams)), \n levels = c(\"ols\", paste0(\"ridge = \", lams)))\n ),\n aes(shape = estimate), size = 3) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2]), colour = orange, size = 4)"
+ "section": "",
+ "text": "draw.tree(bigtree, digits = 2)\n\n\nTerminology\nWe call each split or end point a node. Each terminal node is referred to as a leaf.\nThe interior nodes lead to branches."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#example-data",
- "href": "schedule/slides/08-ridge-regression.html#example-data",
+ "objectID": "schedule/slides/13-gams-trees.html#advantages-and-disadvantages-of-trees",
+ "href": "schedule/slides/13-gams-trees.html#advantages-and-disadvantages-of-trees",
"title": "UBC Stat406 2023W",
- "section": "Example data",
- "text": "Example data\nprostate data from [ESL]\n\ndata(prostate, package = \"ElemStatLearn\")\nprostate |> as_tibble()\n\n# A tibble: 97 × 10\n lcavol lweight age lbph svi lcp gleason pgg45 lpsa train\n <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <int> <dbl> <lgl>\n 1 -0.580 2.77 50 -1.39 0 -1.39 6 0 -0.431 TRUE \n 2 -0.994 3.32 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 3 -0.511 2.69 74 -1.39 0 -1.39 7 20 -0.163 TRUE \n 4 -1.20 3.28 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 5 0.751 3.43 62 -1.39 0 -1.39 6 0 0.372 TRUE \n 6 -1.05 3.23 50 -1.39 0 -1.39 6 0 0.765 TRUE \n 7 0.737 3.47 64 0.615 0 -1.39 6 0 0.765 FALSE\n 8 0.693 3.54 58 1.54 0 -1.39 6 0 0.854 TRUE \n 9 -0.777 3.54 47 -1.39 0 -1.39 6 0 1.05 FALSE\n10 0.223 3.24 63 -1.39 0 -1.39 6 0 1.05 FALSE\n# ℹ 87 more rows\n\n\n\nUse lpsa as response."
+ "section": "Advantages and disadvantages of trees",
+ "text": "Advantages and disadvantages of trees\n🎉 Trees are very easy to explain (much easier than even linear regression).\n🎉 Some people believe that decision trees mirror human decision.\n🎉 Trees can easily be displayed graphically no matter the dimension of the data.\n🎉 Trees can easily handle qualitative predictors without the need to create dummy variables.\n💩 Trees aren’t very good at prediction.\n💩 Full trees badly overfit, so we “prune” them using CV\n\nWe’ll talk more about trees next module for Classification."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-path",
- "href": "schedule/slides/08-ridge-regression.html#ridge-regression-path",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#meta-lecture",
+ "href": "schedule/slides/11-kernel-smoothers.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Ridge regression path",
- "text": "Ridge regression path\n\nY <- prostate$lpsa\nX <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))\nlibrary(glmnet)\nridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)\n\n\n\n\n\nplot(ridge, xvar = \"lambda\", lwd = 3)\n\n\n\n\n\n\n\n\n\n\nModel selection here:\n\nmeans choose some \\(\\lambda\\)\nA value of \\(\\lambda\\) is a vertical line.\nThis graphic is a “path” or “coefficient trace”\nCoefficients for varying \\(\\lambda\\)"
+ "section": "11 Local methods",
+ "text": "11 Local methods\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#solving-the-minimization",
- "href": "schedule/slides/08-ridge-regression.html#solving-the-minimization",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#last-time",
+ "href": "schedule/slides/11-kernel-smoothers.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "Solving the minimization",
- "text": "Solving the minimization\n\nOne nice thing about ridge regression is that it has a closed-form solution (like OLS)\n\n\\[\\brl = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y\\]\n\nThis is easy to calculate in R for any \\(\\lambda\\).\nHowever, computations and interpretation are simplified if we examine the Singular Value Decomposition of \\(\\X = \\mathbf{UDV}^\\top\\).\nRecall: any matrix has an SVD.\nHere \\(\\mathbf{D}\\) is diagonal and \\(\\mathbf{U}\\) and \\(\\mathbf{V}\\) are orthonormal: \\(\\mathbf{U}^\\top\\mathbf{U} = \\mathbf{I}\\)."
+ "section": "Last time…",
+ "text": "Last time…\nWe looked at feature maps as a way to do nonlinear regression.\nWe used new “features” \\(\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)\\)\nNow we examine an alternative\nSuppose I just look at the “neighbours” of some point (based on the \\(x\\)-values)\nI just average the \\(y\\)’s at those locations together"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#solving-the-minization",
- "href": "schedule/slides/08-ridge-regression.html#solving-the-minization",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#lets-use-3-neighbours",
+ "href": "schedule/slides/11-kernel-smoothers.html#lets-use-3-neighbours",
"title": "UBC Stat406 2023W",
- "section": "Solving the minization",
- "text": "Solving the minization\n\\[\\brl = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y\\]\n\nNote that \\(\\mathbf{X}^\\top\\mathbf{X} = \\mathbf{VDU}^\\top\\mathbf{UDV}^\\top = \\mathbf{V}\\mathbf{D}^2\\mathbf{V}^\\top\\).\nThen,\n\n\\[\\brl = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top + \\lambda \\mathbf{I})^{-1}\\mathbf{VDU}^\\top \\y\n= \\mathbf{V}(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1} \\mathbf{DU}^\\top \\y.\\]\n\nFor computations, now we only need to invert \\(\\mathbf{D}\\)."
+ "section": "Let’s use 3 neighbours",
+ "text": "Let’s use 3 neighbours\n\n\nCode\nlibrary(cowplot)\ndata(arcuate, package = \"Stat406\")\nset.seed(406406)\narcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)\npt <- 15\nnn <- 3\nseq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)\nneibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]\narcuate_unif$neighbours = seq_len(40) %in% neibs\ng1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + \n geom_point() +\n scale_colour_manual(values = c(blue, red)) + \n geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + \n annotate(\"rect\", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,\n xmin = min(arcuate_unif$position[neibs]), \n xmax = max(arcuate_unif$position[neibs])\n ) +\n theme(legend.position = \"none\")\ng2 <- ggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(\n data = tibble(\n position = seq_range(arcuate_unif$position),\n fa = FNN::knn.reg(\n arcuate_unif$position, matrix(position, ncol = 1),\n y = arcuate_unif$fa\n )$pred\n ),\n colour = orange, linewidth = 2\n )\nplot_grid(g1, g2, ncol = 2)"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#comparing-with-ols",
- "href": "schedule/slides/08-ridge-regression.html#comparing-with-ols",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#knn",
+ "href": "schedule/slides/11-kernel-smoothers.html#knn",
"title": "UBC Stat406 2023W",
- "section": "Comparing with OLS",
- "text": "Comparing with OLS\n\n\\(\\mathbf{D}\\) is a diagonal matrix\n\n\\[\\bls = (\\X^\\top\\X)^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top)^{-1}\\mathbf{VDU}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-2}\\mathbf{D}}\\mathbf{U}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-1}}\\mathbf{U}^\\top \\y\\]\n\\[\\brl = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = \\mathbf{V}\\color{red}{(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1}} \\mathbf{DU}^\\top \\y.\\]\n\nNotice that \\(\\bls\\) depends on \\(d_j/d_j^2\\) while \\(\\brl\\) depends on \\(d_j/(d_j^2 + \\lambda)\\).\nRidge regression makes the coefficients smaller relative to OLS.\nBut if \\(\\X\\) has small singular values, ridge regression compensates with \\(\\lambda\\) in the denominator."
+ "section": "KNN",
+ "text": "KNN\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ndata(arcuate, package = \"Stat406\")\nlibrary(FNN)\narcuate_unif <- arcuate |> \n slice_sample(n = 40) |> \n arrange(position) \n\nnew_position <- seq(\n min(arcuate_unif$position), \n max(arcuate_unif$position),\n length.out = 101\n)\n\nknn3 <- knn.reg(\n train = arcuate_unif$position, \n test = matrix(arcuate_unif$position, ncol = 1), \n y = arcuate_unif$fa, \n k = 3\n)"
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-and-multicollinearity",
- "href": "schedule/slides/08-ridge-regression.html#ridge-regression-and-multicollinearity",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#this-method-is-k-nearest-neighbours.",
+ "href": "schedule/slides/11-kernel-smoothers.html#this-method-is-k-nearest-neighbours.",
"title": "UBC Stat406 2023W",
- "section": "Ridge regression and multicollinearity",
- "text": "Ridge regression and multicollinearity\nMulticollinearity: a linear combination of predictor variables is nearly equal to another predictor variable.\nSome comments:\n\nA better phrase: \\(\\X\\) is ill-conditioned\nAKA “(numerically) rank-deficient”.\n\\(\\X = \\mathbf{U D V}^\\top\\) ill-conditioned \\(\\Longleftrightarrow\\) some elements of \\(\\mathbf{D} \\approx 0\\)\n\\(\\bls= \\mathbf{V D}^{-1} \\mathbf{U}^\\top \\y\\), so small entries of \\(\\mathbf{D}\\) \\(\\Longleftrightarrow\\) huge elements of \\(\\mathbf{D}^{-1}\\)\nMeans huge variance: \\(\\Var{\\bls} = \\sigma^2(\\X^\\top \\X)^{-1} = \\sigma^2 \\mathbf{V D}^{-2} \\mathbf{V}^\\top\\)"
+ "section": "This method is \\(K\\)-nearest neighbours.",
+ "text": "This method is \\(K\\)-nearest neighbours.\nIt’s a linear smoother just like in previous lectures: \\(\\widehat{\\mathbf{y}} = \\mathbf{S} \\mathbf{y}\\) for some matrix \\(S\\).\nYou should imagine what \\(\\mathbf{S}\\) looks like.\nWhat is the degrees of freedom of KNN?\nKNN averages the neighbours with equal weight.\nBut some neighbours are “closer” than other neighbours."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-and-ill-posed-x",
- "href": "schedule/slides/08-ridge-regression.html#ridge-regression-and-ill-posed-x",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#local-averages",
+ "href": "schedule/slides/11-kernel-smoothers.html#local-averages",
"title": "UBC Stat406 2023W",
- "section": "Ridge regression and ill-posed \\(\\X\\)",
- "text": "Ridge regression and ill-posed \\(\\X\\)\nRidge Regression fixes this problem by preventing the division by a near-zero number\n\nConclusion\n\n\\((\\X^{\\top}\\X)^{-1}\\) can be really unstable, while \\((\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1}\\) is not.\n\nAside\n\nEngineering approach to solving linear systems is to always do this with small \\(\\lambda\\). The thinking is about the numerics rather than the statistics.\n\n\nWhich \\(\\lambda\\) to use?\n\nComputational\n\nUse CV and pick the \\(\\lambda\\) that makes this smallest.\n\nIntuition (bias)\n\nAs \\(\\lambda\\rightarrow\\infty\\), bias ⬆\n\nIntuition (variance)\n\nAs \\(\\lambda\\rightarrow\\infty\\), variance ⬇\n\n\nYou should think about why."
+ "section": "Local averages",
+ "text": "Local averages\nInstead of choosing the number of neighbours to average, we can average any observations within a certain distance.\n\n\nThe boxes have width 30."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds",
- "href": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#what-is-a-kernel-smoother",
+ "href": "schedule/slides/11-kernel-smoothers.html#what-is-a-kernel-smoother",
"title": "UBC Stat406 2023W",
- "section": "Can we get the best of both worlds?",
- "text": "Can we get the best of both worlds?\nTo recap:\n\nDeciding which predictors to include, adding quadratic terms, or interactions is model selection (more precisely variable selection within a linear model).\nRidge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity.\nIf the LM is true,\n\nOLS is unbiased, but Variance depends on \\(\\mathbf{D}^{-2}\\). Can be big.\nRidge is biased (can you find the bias?). But Variance is smaller than OLS.\n\nRidge regression does not perform variable selection.\nBut picking \\(\\lambda=3.7\\) and thereby deciding to predict with \\(\\widehat{\\beta}^R_{3.7}\\) is model selection."
+ "section": "What is a “kernel” smoother?",
+ "text": "What is a “kernel” smoother?\n\nThe mathematics:\n\n\nA kernel is any function \\(K\\) such that for any \\(u\\), \\(K(u) \\geq 0\\), \\(\\int du K(u)=1\\) and \\(\\int uK(u)du=0\\).\n\n\nThe idea: a kernel is a nice way to take weighted averages. The kernel function gives the weights.\nThe previous example is called the boxcar kernel."
},
{
- "objectID": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds-1",
- "href": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds-1",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-with-the-boxcar",
+ "href": "schedule/slides/11-kernel-smoothers.html#smoothing-with-the-boxcar",
"title": "UBC Stat406 2023W",
- "section": "Can we get the best of both worlds?",
- "text": "Can we get the best of both worlds?\n\nRidge regression\n\n\\(\\minimize \\frac{1}{n}||\\y-\\X\\beta||_2^2 \\ \\st\\ ||\\beta||_2^2 \\leq s\\)\n\nBest (in-sample) linear regression model of size \\(s\\)\n\n\\(\\minimize \\frac{1}{n}||\\y-\\X\\beta||_2^2 \\ \\st\\ ||\\beta||_0 \\leq s\\)\n\n\n\\(||\\beta||_0\\) is the number of nonzero elements in \\(\\beta\\)\nFinding the best in-sample linear model (of size \\(s\\), among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard)\nRidge regression is convex (easy to solve), but doesn’t do variable selection\nCan we somehow “interpolate” to get both?\nNote: selecting \\(\\lambda\\) is still model selection, but we’ve included all the variables."
+ "section": "Smoothing with the boxcar",
+ "text": "Smoothing with the boxcar\n\n\nCode\ntestpts <- seq(0, 200, length.out = 101)\ndmat <- abs(outer(testpts, arcuate_unif$position, \"-\"))\nS <- (dmat < 15)\nS <- S / rowSums(S)\nboxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(data = boxcar, colour = orange)\n\n\n\nThis one gives the same non-zero weight to all points within \\(\\pm 15\\) range."
},
{
- "objectID": "schedule/slides/06-information-criteria.html#meta-lecture",
- "href": "schedule/slides/06-information-criteria.html#meta-lecture",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#other-kernels",
+ "href": "schedule/slides/11-kernel-smoothers.html#other-kernels",
"title": "UBC Stat406 2023W",
- "section": "06 Information criteria",
- "text": "06 Information criteria\nStat 406\nDaniel J. McDonald\nLast modified – 26 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Other kernels",
+ "text": "Other kernels\nMost of the time, we don’t use the boxcar because the weights are weird. (constant)\nA more common one is the Gaussian kernel:\n\n\nCode\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n\n\n\nFor the plot, I made \\(\\sigma=7.5\\).\nNow the weights “die away” for points farther from where we’re predicting. (but all nonzero!!)"
},
{
- "objectID": "schedule/slides/06-information-criteria.html#generalized-cv",
- "href": "schedule/slides/06-information-criteria.html#generalized-cv",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#other-kernels-1",
+ "href": "schedule/slides/11-kernel-smoothers.html#other-kernels-1",
"title": "UBC Stat406 2023W",
- "section": "Generalized CV",
- "text": "Generalized CV\nLast time we saw a nice trick, that works some of the time (OLS, Ridge regression,…)\n\\[\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.\\]\n\n\\(\\widehat{\\y} = \\widehat{f}(\\mathbf{X}) = \\mathbf{H}\\mathbf{y}\\) for some matrix \\(\\mathbf{H}\\).\nA technical thing.\n\n\\[\\newcommand{\\H}{\\mathbf{H}}\\]"
+ "section": "Other kernels",
+ "text": "Other kernels\nWhat if I made \\(\\sigma=15\\)?\n\n\nCode\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n\n\n\nBefore, points far from \\(x_{15}\\) got very small weights, now they have more influence.\nFor the Gaussian kernel, \\(\\sigma\\) determines something like the “range” of the smoother."
},
{
- "objectID": "schedule/slides/06-information-criteria.html#this-is-another-nice-trick.",
- "href": "schedule/slides/06-information-criteria.html#this-is-another-nice-trick.",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#many-gaussians",
+ "href": "schedule/slides/11-kernel-smoothers.html#many-gaussians",
"title": "UBC Stat406 2023W",
- "section": "This is another nice trick.",
- "text": "This is another nice trick.\nIdea: replace \\(h_{ii}\\) with \\(\\frac{1}{n}\\sum_{i=1}^n h_{ii} = \\frac{1}{n}\\textrm{tr}(\\mathbf{H})\\)\nLet’s call \\(\\textrm{tr}(\\mathbf{H})\\) the degrees-of-freedom (or just df)\n\\[\\textrm{GCV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-\\textrm{df}/n)^2} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}\\]\nWhere does this stuff come from?"
+ "section": "Many Gaussians",
+ "text": "Many Gaussians\nThe following code creates \\(\\mathbf{S}\\) for Gaussian kernel smoothers with different \\(\\sigma\\)\n\ndmat <- as.matrix(dist(x))\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg), \"/\") # make the rows sum to 1.\n}\n\n\n\nCode\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.\n}\nboxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)\nboxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)\nboxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)\nbc = boxcar %>% select(position, S15, S08, S30) %>% \n pivot_longer(-position, names_to = \"Sigma\")\nggplot(arcuate_unif, aes(position, fa)) + \n geom_point(colour = blue) + \n geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +\n scale_colour_brewer(palette = \"Set1\")"
},
{
- "objectID": "schedule/slides/06-information-criteria.html#what-are-hatvalues",
- "href": "schedule/slides/06-information-criteria.html#what-are-hatvalues",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#the-bandwidth",
+ "href": "schedule/slides/11-kernel-smoothers.html#the-bandwidth",
"title": "UBC Stat406 2023W",
- "section": "What are hatvalues?",
- "text": "What are hatvalues?\n\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n\nIn OLS, \\(\\widehat{\\y} = \\X\\widehat{\\beta} = \\X(\\X^\\top \\X)^{-1}\\X^\\top \\y\\)\nWe often call \\(\\mathbf{H} = \\X(\\X^\\top \\X)^{-1}\\X^\\top\\) the Hat matrix, because it puts the hat on \\(\\y\\)\nGCV uses \\(\\textrm{tr}(\\mathbf{H})\\).\nFor lm(), this is just p, the number of predictors (Why?)\nThis is one way of understanding the name degrees-of-freedom"
+ "section": "The bandwidth",
+ "text": "The bandwidth\n\nChoosing \\(\\sigma\\) is very important.\nThis “range” parameter is called the bandwidth.\nIt is way more important than which kernel you use.\nThe default kernel in ksmooth() is something called ‘Epanechnikov’:\n\n\nepan <- function(x) 3/4 * (1 - x^2) * (abs(x) < 1)\nggplot(data.frame(x = c(-2, 2)), aes(x)) + stat_function(fun = epan, colour = green, linewidth = 2)"
},
{
- "objectID": "schedule/slides/06-information-criteria.html#alternative-interpretation",
- "href": "schedule/slides/06-information-criteria.html#alternative-interpretation",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#choosing-the-bandwidth",
+ "href": "schedule/slides/11-kernel-smoothers.html#choosing-the-bandwidth",
"title": "UBC Stat406 2023W",
- "section": "Alternative interpretation:",
- "text": "Alternative interpretation:\nSuppose, \\(Y_i\\) is independent from some distribution with mean \\(\\mu_i\\) and variance \\(\\sigma^2\\)\n(remember: in the linear model \\(\\Expect{Y_i} = x_i^\\top \\beta = \\mu_i\\) )\nLet \\(\\widehat{\\mathbf{Y}}\\) be an estimator of \\(\\mu\\) (all \\(i=1,\\ldots,n\\) elements of the vector).\n\n\\[\\begin{aligned}\n& \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \\\\\n&= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-Y_i + Y_i -\\mu_i)^2}\\\\\n&= \\frac{1}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)^2} + \\frac{1}{n}\\Expect{\\sum (Y_i-\\mu_i)^2} + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)}\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} + \\sigma^2 + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)} = \\cdots =\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\n\\end{aligned}\\]"
+ "section": "Choosing the bandwidth",
+ "text": "Choosing the bandwidth\nAs we have discussed, kernel smoothing (and KNN) are linear smoothers\n\\[\\widehat{\\mathbf{y}} = \\mathbf{S}\\mathbf{y}\\]\nThe degrees of freedom is \\(\\textrm{tr}(\\mathbf{S})\\)\nTherefore we can use our model selection criteria from before\n\nUnfortunately, these don’t satisfy the “technical condition”, so cv_nice() doesn’t give LOO-CV"
},
{
- "objectID": "schedule/slides/06-information-criteria.html#alternative-interpretation-1",
- "href": "schedule/slides/06-information-criteria.html#alternative-interpretation-1",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data",
+ "href": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data",
"title": "UBC Stat406 2023W",
- "section": "Alternative interpretation:",
- "text": "Alternative interpretation:\n\\[\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} = \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\\]\nNow, if \\(\\widehat{\\mathbf{Y}} = \\H \\mathbf{Y}\\) for some matrix \\(\\H\\),\n\\(\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{\\mathbf{Y}^\\top \\H \\mathbf{Y}} = \\sigma^2 \\textrm{tr}(\\H)\\)\nThis gives Mallow’s \\(C_p\\) aka Stein’s Unbiased Risk Estimator:\n\\(MSE + 2\\hat{\\sigma}^2\\textrm{df}/n\\)\n\n\n\n\n\n\nImportant\n\n\nUnfortunately, df may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course."
+ "section": "Smoothing the full Lidar data",
+ "text": "Smoothing the full Lidar data\n\nar <- arcuate |> slice_sample(n = 200)\n\ngcv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )\n}\n\nfake_loocv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - diag(S))^2 )\n}\n\ndmat <- as.matrix(dist(ar$position))\nsigmas <- 10^(seq(log10(300), log10(.3), length = 100))\n\ngcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))\nflcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))\nbest_s <- sigmas[which.min(gcvs)]\nother_s <- sigmas[which.min(flcvs)]\n\nar$smoothed <- Sgauss(best_s) %*% ar$fa\nar$other <- Sgauss(other_s) %*% ar$fa"
},
{
- "objectID": "schedule/slides/06-information-criteria.html#aic-and-bic",
- "href": "schedule/slides/06-information-criteria.html#aic-and-bic",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data-1",
+ "href": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data-1",
"title": "UBC Stat406 2023W",
- "section": "AIC and BIC",
- "text": "AIC and BIC\nThese have a very similar flavor to \\(C_p\\), but their genesis is different.\nWithout going into too much detail, they look like\n\\(\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n\\)\n\\(\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n\\)\n\nIn the case of a linear model with Gaussian errors and \\(p\\) predictors\n\\[\\begin{aligned}\n\\textrm{AIC}/n &= \\log(2\\pi) + \\log(RSS/n) + 2(p+1)/n \\\\\n&\\propto \\log(RSS) + 2(p+1)/n\n\\end{aligned}\\]\n( \\(p+1\\) because of the unknown variance, intercept included in \\(p\\) or not)\n\n\n\n\n\n\n\n\nImportant\n\n\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here."
+ "section": "Smoothing the full Lidar data",
+ "text": "Smoothing the full Lidar data\n\n\nCode\ng3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +\n geom_point(colour = blue) +\n geom_vline(xintercept = best_s, colour = red) +\n scale_x_log10() +\n xlab(sprintf(\"Sigma, best is sig = %.2f\", best_s))\ng4 <- ggplot(ar, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(aes(y = smoothed), colour = orange, linewidth = 2)\nplot_grid(g3, g4, nrow = 1)\n\n\n\nI considered \\(\\sigma \\in [0.3,\\ 300]\\) and used \\(3.97\\).\nIt’s too wiggly, to my eye. Typical for GCV."
},
{
- "objectID": "schedule/slides/06-information-criteria.html#over-fitting-vs.-under-fitting",
- "href": "schedule/slides/06-information-criteria.html#over-fitting-vs.-under-fitting",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-manually",
+ "href": "schedule/slides/11-kernel-smoothers.html#smoothing-manually",
"title": "UBC Stat406 2023W",
- "section": "Over-fitting vs. Under-fitting",
- "text": "Over-fitting vs. Under-fitting\n\nOver-fitting means estimating a really complicated function when you don’t have enough data.\n\nThis is likely a low-bias / high-variance situation.\n\nUnder-fitting means estimating a really simple function when you have lots of data.\n\nThis is likely a high-bias / low-variance situation.\nBoth of these outcomes are bad (they have high risk \\(=\\) big \\(R_n\\) ).\nThe best way to avoid them is to use a reasonable estimate of prediction risk to choose how complicated your model should be."
+ "section": "Smoothing manually",
+ "text": "Smoothing manually\nI did Kernel Smoothing “manually”\n\nFor a fixed bandwidth\nCompute the smoothing matrix\nMake the predictions\nRepeat and compute GCV\n\nThe point is to “show how it works”. It’s also really easy."
},
{
- "objectID": "schedule/slides/06-information-criteria.html#recommendations",
- "href": "schedule/slides/06-information-criteria.html#recommendations",
+ "objectID": "schedule/slides/11-kernel-smoothers.html#r-functions-packages",
+ "href": "schedule/slides/11-kernel-smoothers.html#r-functions-packages",
"title": "UBC Stat406 2023W",
- "section": "Recommendations",
- "text": "Recommendations\n\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.\nCV is usually easiest to make sense of and doesn’t depend on other unknown parameters.\nBut, it requires refitting the model.\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc."
+ "section": "R functions / packages",
+ "text": "R functions / packages\nThere are a number of other ways to do this in R\n\nloess()\nksmooth()\nKernSmooth::locpoly()\nmgcv::gam()\nnp::npreg()\n\nThese have tricks and ways of doing CV and other things automatically.\n\nNote\n\nAll I needed was the distance matrix dist(x).\n\n\nGiven ANY distance function\n\n\nsay, \\(d(\\mathbf{x}_i, \\mathbf{x}_j) = \\Vert\\mathbf{x}_i - \\mathbf{x}_j\\Vert_2 + I(x_{i,3} = x_{j,3})\\)\n\n\nI can use these methods."
},
{
- "objectID": "schedule/slides/06-information-criteria.html#high-level-intuition-of-these",
- "href": "schedule/slides/06-information-criteria.html#high-level-intuition-of-these",
+ "objectID": "schedule/slides/09-l1-penalties.html#meta-lecture",
+ "href": "schedule/slides/09-l1-penalties.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "High-level intuition of these:",
- "text": "High-level intuition of these:\n\nGCV tends to choose “dense” models.\nTheory says AIC chooses the “best predicting model” asymptotically.\nTheory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.\nIn some special cases, AIC = Cp = SURE \\(\\approx\\) LOO-CV\nAs a technical point, CV (or validation set) is estimating error on new data, unseen \\((X_0, Y_0)\\), while AIC / CP are estimating error on new Y at the observed \\(x_1,\\ldots,x_n\\). This is subtle.\n\n\n\nFor more information: see [ESL] Chapter 7. This material is more challenging than the level of this course, and is easily and often misunderstood."
+ "section": "09 L1 penalties",
+ "text": "09 L1 penalties\nStat 406\nDaniel J. McDonald\nLast modified – 02 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/06-information-criteria.html#a-few-more-caveats",
- "href": "schedule/slides/06-information-criteria.html#a-few-more-caveats",
+ "objectID": "schedule/slides/09-l1-penalties.html#last-time",
+ "href": "schedule/slides/09-l1-penalties.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "A few more caveats",
- "text": "A few more caveats\nIt is often tempting to “just compare” risk estimates from vastly different models.\nFor example,\n\ndifferent transformations of the predictors,\ndifferent transformations of the response,\nPoisson likelihood vs. Gaussian likelihood in glm()\n\nThis is not always justified.\n\nThe “high-level intuition” is for “nested” models.\nDifferent likelihoods aren’t comparable.\nResiduals / response variables on different scales aren’t directly comparable.\n\n“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks."
+ "section": "Last time",
+ "text": "Last time\n\nRidge regression\n\n\\(\\min \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2 \\st \\snorm{\\beta}_2^2 \\leq s\\)\n\nBest (in sample) linear regression model of size \\(s\\)\n\n\\(\\min \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 \\st \\snorm{\\beta}_0 \\leq s\\)\n\n\n\\(\\snorm{\\beta}_0\\) is the number of nonzero elements in \\(\\beta\\)\nFinding the “best” linear model (of size \\(s\\), among these predictors, in sample) is a nonconvex optimization problem (In fact, it is NP-hard)\nRidge regression is convex (easy to solve), but doesn’t do variable selection\nCan we somehow “interpolate” to get both?"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#meta-lecture",
- "href": "schedule/slides/04-bias-variance.html#meta-lecture",
+ "objectID": "schedule/slides/09-l1-penalties.html#geometry-of-convexity",
+ "href": "schedule/slides/09-l1-penalties.html#geometry-of-convexity",
"title": "UBC Stat406 2023W",
- "section": "04 Bias and variance",
- "text": "04 Bias and variance\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "Geometry of convexity",
+ "text": "Geometry of convexity\n\n\nCode\nlibrary(mvtnorm)\nnormBall <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- data.frame(x = cos(tg)) %>%\n mutate(b = (1 - abs(x)^q)^(1 / q), bm = -b) %>%\n gather(key = \"lab\", value = \"y\", -x)\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipseData <- function(n = 100, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n df <- expand.grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)\n )\n df$z <- dmvnorm(df, mean, Sigma)\n df\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6) {\n ed <- filter(ed, x > 0, y > 0)\n for (i in 1:20) {\n ff <- abs((ed$x^q + ed$y^q)^(1 / q) - 1) < tol\n if (sum(ff) > 0) break\n tol <- 2 * tol\n }\n best <- ed[ff, ]\n best[which.max(best$z), ]\n}\n\nnbs <- list()\nnbs[[1]] <- normBall(0, 1)\nqs <- c(.5, .75, 1, 1.5, 2)\nfor (ii in 2:6) nbs[[ii]] <- normBall(qs[ii - 1])\nnbs <- bind_rows(nbs)\nnbs$lab <- factor(nbs$lab, levels = unique(nbs$lab))\nseg <- data.frame(\n lab = levels(nbs$lab)[1],\n x0 = c(-1, 0), x1 = c(1, 0), y0 = c(0, -1), y1 = c(0, 1)\n)\nlevels(seg$lab) <- levels(nbs$lab)\nggplot(nbs, aes(x, y)) +\n geom_path(size = 1.2) +\n facet_wrap(~lab, labeller = label_parsed) +\n geom_segment(data = seg, aes(x = x0, xend = x1, y = y0, yend = y1), size = 1.2) +\n theme_bw(base_family = \"\", base_size = 24) +\n coord_equal() +\n scale_x_continuous(breaks = c(-1, 0, 1)) +\n scale_y_continuous(breaks = c(-1, 0, 1)) +\n geom_vline(xintercept = 0, size = .5) +\n geom_hline(yintercept = 0, size = .5) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2]))"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#section",
- "href": "schedule/slides/04-bias-variance.html#section",
+ "objectID": "schedule/slides/09-l1-penalties.html#the-best-of-both-worlds",
+ "href": "schedule/slides/09-l1-penalties.html#the-best-of-both-worlds",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "We just talked about\n\nVariance of an estimator.\nIrreducible error when making predictions.\nThese are 2 of the 3 components of the “Prediction Risk” \\(R_n\\)"
+ "section": "The best of both worlds",
+ "text": "The best of both worlds\n\n\nCode\nnb <- normBall(1)\ned <- ellipseData()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 1)\nggplot(nb, aes(x, y)) +\n geom_path(colour = red) +\n geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal(xlim = c(-2, 2), ylim = c(-2, 2)) +\n theme_bw(base_family = \"\", base_size = 24) +\n geom_label(\n data = bols, mapping = aes(label = bquote(\"hat(beta)[ols]\")), parse = TRUE,\n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n geom_label(\n data = bhat, mapping = aes(label = bquote(\"hat(beta)[s]^L\")), parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )\n\n\n\nThis regularization set…\n\n… is convex (computationally efficient)\n… has corners (performs variable selection)"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#component-3-the-bias",
- "href": "schedule/slides/04-bias-variance.html#component-3-the-bias",
+ "objectID": "schedule/slides/09-l1-penalties.html#ell_1-regularized-regression",
+ "href": "schedule/slides/09-l1-penalties.html#ell_1-regularized-regression",
"title": "UBC Stat406 2023W",
- "section": "Component 3, the Bias",
- "text": "Component 3, the Bias\nWe need to be specific about what we mean when we say bias.\nBias is neither good nor bad in and of itself.\nA very simple example: let \\(Z_1,\\ \\ldots,\\ Z_n \\sim N(\\mu, 1)\\). - We don’t know \\(\\mu\\), so we try to use the data (the \\(Z_i\\)’s) to estimate it.\n\nI propose 3 estimators:\n\n\\(\\widehat{\\mu}_1 = 12\\),\n\\(\\widehat{\\mu}_2=Z_6\\),\n\\(\\widehat{\\mu}_3=\\overline{Z}\\).\n\nThe bias (by definition) of my estimator is \\(E[\\widehat{\\mu_i}]-\\mu\\).\n\n\nCalculate the bias and variance of each estimator."
+ "section": "\\(\\ell_1\\)-regularized regression",
+ "text": "\\(\\ell_1\\)-regularized regression\nKnown as\n\n“lasso”\n“basis pursuit”\n\nThe estimator satisfies\n\\[\\blt = \\argmin_{ \\snorm{\\beta}_1 \\leq s} \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2\\]\nIn its corresponding Lagrangian dual form:\n\\[\\bll = \\argmin_{\\beta} \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1\\]"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#regression-in-general",
- "href": "schedule/slides/04-bias-variance.html#regression-in-general",
+ "objectID": "schedule/slides/09-l1-penalties.html#lasso",
+ "href": "schedule/slides/09-l1-penalties.html#lasso",
"title": "UBC Stat406 2023W",
- "section": "Regression in general",
- "text": "Regression in general\nIf I want to predict \\(Y\\) from \\(X\\), it is almost always the case that\n\\[\n\\mu(x) = \\Expect{Y\\given X=x} \\neq x^{\\top}\\beta\n\\]\nSo the bias of using a linear model is not zero.\n\nWhy? Because\n\\[\n\\Expect{Y\\given X=x}-x^\\top\\beta \\neq \\Expect{Y\\given X=x} - \\mu(x) = 0.\n\\]\nWe can include as many predictors as we like,\nbut this doesn’t change the fact that the world is non-linear."
+ "section": "Lasso",
+ "text": "Lasso\nWhile the ridge solution can be easily computed\n\\[\\brl = \\argmin_{\\beta} \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_2^2 = (\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1} \\X^{\\top}\\y\\]\nthe lasso solution\n\\[\\bll = \\argmin_{\\beta} \\frac 1n\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1 = \\; ??\\]\ndoesn’t have a closed-form solution.\nHowever, because the optimization problem is convex, there exist efficient algorithms for computing it\n\n\nThe best are Iterative Soft Thresholding or Coordinate Descent. Gradient Descent doesn’t work very well in practice."
},
{
- "objectID": "schedule/slides/04-bias-variance.html#continuation-predicting-new-ys",
- "href": "schedule/slides/04-bias-variance.html#continuation-predicting-new-ys",
+ "objectID": "schedule/slides/09-l1-penalties.html#coefficient-path-ridge-vs-lasso",
+ "href": "schedule/slides/09-l1-penalties.html#coefficient-path-ridge-vs-lasso",
"title": "UBC Stat406 2023W",
- "section": "(Continuation) Predicting new Y’s",
- "text": "(Continuation) Predicting new Y’s\nSuppose we want to predict \\(Y\\),\nwe know \\(E[Y]= \\mu \\in \\mathbb{R}\\) and \\(\\textrm{Var}[Y] = 1\\).\nOur data is \\(\\{y_1,\\ldots,y_n\\}\\)\nWe have considered estimating \\(\\mu\\) in various ways, and using \\(\\widehat{Y} = \\widehat{\\mu}\\)\n\n\nLet’s try one more: \\(\\widehat Y_a = a\\overline{Y}_n\\) for some \\(a \\in (0,1]\\)."
+ "section": "Coefficient path: ridge vs lasso",
+ "text": "Coefficient path: ridge vs lasso\n\n\nCode\nlibrary(glmnet)\ndata(prostate, package = \"ElemStatLearn\")\nX <- prostate |> dplyr::select(-train, -lpsa) |> as.matrix()\nY <- prostate$lpsa\nlasso <- glmnet(x = X, y = Y) # alpha = 1 by default\nridge <- glmnet(x = X, y = Y, alpha = 0)\nop <- par()\n\n\n\npar(mfrow = c(1, 2), mar = c(5, 3, 5, .1))\nplot(lasso, main = \"Lasso\")\nplot(ridge, main = \"Ridge\")"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#one-can-show-wait-for-the-proof",
- "href": "schedule/slides/04-bias-variance.html#one-can-show-wait-for-the-proof",
+ "objectID": "schedule/slides/09-l1-penalties.html#same-but-against-lambda",
+ "href": "schedule/slides/09-l1-penalties.html#same-but-against-lambda",
"title": "UBC Stat406 2023W",
- "section": "One can show… (wait for the proof)",
- "text": "One can show… (wait for the proof)\n\\(\\widehat Y_a = a\\overline{Y}_n\\) for some \\(a \\in (0,1]\\)\n\\[\nR_n(\\widehat Y_a) = \\Expect{(\\widehat Y_a-Y)^2} = (1 - a)^2\\mu^2 +\n\\frac{a^2}{n} +1\n\\]\n\nWe can minimize this in \\(a\\) to get the best possible prediction risk for an estimator of the form \\(\\widehat Y_a\\):\n\\[\n\\argmin_{a} R_n(\\widehat Y_a) = \\left(\\frac{\\mu^2}{\\mu^2 + 1/n} \\right)\n\\]\n\n\nWhat happens if \\(\\mu \\ll 1\\)?"
+ "section": "Same but against Lambda",
+ "text": "Same but against Lambda\n\npar(mfrow = c(1, 2), mar = c(5, 3, 5, .1))\nplot(lasso, main = \"Lasso\", xvar = \"lambda\")\nplot(ridge, main = \"Ridge\", xvar = \"lambda\")"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#section-1",
- "href": "schedule/slides/04-bias-variance.html#section-1",
+ "objectID": "schedule/slides/09-l1-penalties.html#additional-intuition-for-why-lasso-selects-variables",
+ "href": "schedule/slides/09-l1-penalties.html#additional-intuition-for-why-lasso-selects-variables",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Important\n\n\n\nWait a minute! I’m saying there is a better estimator than \\(\\overline{Y}_n\\)!"
+ "section": "Additional intuition for why Lasso selects variables",
+ "text": "Additional intuition for why Lasso selects variables\nSuppose, for a particular \\(\\lambda\\), I have solutions for \\(\\widehat{\\beta}_k\\), \\(k = 1,\\ldots,j-1, j+1,\\ldots,p\\).\nLet \\(\\widehat{\\y}_{-j} = \\X_{-j}\\widehat{\\beta}_{-j}\\), and assume WLOG \\(\\overline{\\X}_k = 0\\), \\(\\X_k^\\top\\X_k = 1\\ \\forall k\\)\nOne can show that:\n\\[\n\\widehat{\\beta}_j = S\\left(\\mathbf{X}^\\top_j(\\y - \\widehat{\\y}_{-j}),\\ \\lambda\\right).\n\\]\n\\[\nS(z, \\gamma) = \\textrm{sign}(z)(|z| - \\gamma)_+ = \\begin{cases} z - \\gamma & z > \\gamma\\\\\nz + \\gamma & z < -\\gamma \\\\ 0 & |z| \\leq \\gamma \\end{cases}\n\\]\n\nIterating over this is called coordinate descent and gives the solution\n\n\n\n\nIf I were told all the other coefficient estimates.\nThen to find this one, I’d shrink when the gradient is big, or set to 0 if it gets too small.\n\n\n\nSee for example, https://doi.org/10.18637/jss.v033.i01"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-estimating-the-mean",
- "href": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-estimating-the-mean",
+ "objectID": "schedule/slides/09-l1-penalties.html#packages",
+ "href": "schedule/slides/09-l1-penalties.html#packages",
"title": "UBC Stat406 2023W",
- "section": "Bias-variance tradeoff: Estimating the mean",
- "text": "Bias-variance tradeoff: Estimating the mean\n\\[\nR_n(\\widehat Y_a) = (a - 1)^2\\mu^2 + \\frac{a^2}{n} + \\sigma^2\n\\]\n\nmu = 1; n = 5; sig = 1"
+ "section": "Packages",
+ "text": "Packages\nThere are two main R implementations for finding lasso\n{glmnet}: lasso = glmnet(X, Y, alpha=1).\n\nSetting alpha = 0 gives ridge regression (as does lm.ridge in the MASS package)\nSetting alpha \\(\\in (0,1)\\) gives a method called the “elastic net” which combines ridge regression and lasso, more on that next lecture.\nIf you don’t specify alpha, it does lasso\n\n{lars}: lars = lars(X, Y)\n\nlars() also does other things called “Least angle” and “forward stagewise” in addition to “forward stepwise” regression\nThe path returned by lars() is more useful than that returned by glmnet().\n\n\nBut you should use {glmnet}."
},
{
- "objectID": "schedule/slides/04-bias-variance.html#to-restate",
- "href": "schedule/slides/04-bias-variance.html#to-restate",
+ "objectID": "schedule/slides/09-l1-penalties.html#choosing-the-lambda",
+ "href": "schedule/slides/09-l1-penalties.html#choosing-the-lambda",
"title": "UBC Stat406 2023W",
- "section": "To restate",
- "text": "To restate\nIf \\(\\mu=\\) 1 and \\(n=\\) 5\nthen it is better to predict with 0.83 \\(\\overline{Y}_5\\)\nthan with \\(\\overline{Y}_5\\) itself.\n\nFor this \\(a =\\) 0.83 and \\(n=5\\)\n\n\\(R_5(\\widehat{Y}_a) =\\) 1.17\n\\(R_5(\\overline{Y}_5)=\\) 1.2"
+ "section": "Choosing the \\(\\lambda\\)",
+ "text": "Choosing the \\(\\lambda\\)\nYou have to choose \\(\\lambda\\) in lasso or in ridge regression\nlasso selects variables (by setting coefficients to zero), but the value of \\(\\lambda\\) determines how many/which.\nAll of these packages come with CV built in.\nHowever, the way to do it differs from package to package"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#prediction-risk",
- "href": "schedule/slides/04-bias-variance.html#prediction-risk",
+ "objectID": "schedule/slides/09-l1-penalties.html#glmnet-version-same-procedure-for-lasso-or-ridge",
+ "href": "schedule/slides/09-l1-penalties.html#glmnet-version-same-procedure-for-lasso-or-ridge",
"title": "UBC Stat406 2023W",
- "section": "Prediction risk",
- "text": "Prediction risk\n(Now using generic prediction function \\(f\\))\n\\[\nR_n(f) = \\Expect{(Y - f(X))^2}\n\\]\nWhy should we care about \\(R_n(f)\\)?\n👍 Measures predictive accuracy on average.\n👍 How much confidence should you have in \\(f\\)’s predictions.\n👍 Compare with other predictors: \\(R_n(f)\\) vs \\(R_n(g)\\)\n🤮 This is hard: Don’t know the distribution of the data (if I knew the truth, this would be easy)"
+ "section": "{glmnet} version (same procedure for lasso or ridge)",
+ "text": "{glmnet} version (same procedure for lasso or ridge)\n\nlasso <- cv.glmnet(X, Y) # estimate full model and CV no good reason to call glmnet() itself\n# 2. Look at the CV curve. If the dashed lines are at the boundaries, redo and adjust lambda\nlambda_min <- lasso$lambda.min # the value, not the location (or use lasso$lambda.1se)\ncoeffs <- coefficients(lasso, s = \"lambda.min\") # s can be string or a number\npreds <- predict(lasso, newx = X, s = \"lambda.1se\") # must supply `newx`\n\n\n\\(\\widehat{R}_{CV}\\) is an estimator of \\(R_n\\), it has bias and variance\nBecause we did CV, we actually have 10 \\(\\widehat{R}\\) values, 1 per split.\nCalculate the mean (that’s what we’ve been using), but what about SE?"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#bias-variance-decomposition",
- "href": "schedule/slides/04-bias-variance.html#bias-variance-decomposition",
+ "objectID": "schedule/slides/09-l1-penalties.html#section",
+ "href": "schedule/slides/09-l1-penalties.html#section",
"title": "UBC Stat406 2023W",
- "section": "Bias-variance decomposition",
- "text": "Bias-variance decomposition\n\\[R_n(\\widehat{Y}_a)=(a - 1)^2\\mu^2 + \\frac{a^2}{n} + 1\\]\n\nprediction risk = \\(\\textrm{bias}^2\\) + variance + irreducible error\nestimation risk = \\(\\textrm{bias}^2\\) + variance\n\nWhat is \\(R_n(\\widehat{Y}_a)\\) for our estimator \\(\\widehat{Y}_a=a\\overline{Y}_n\\)?\n\\[\\begin{aligned}\n\\textrm{bias}(\\widehat{Y}_a) &= \\Expect{a\\overline{Y}_n} - \\mu=(a-1)\\mu\\\\\n\\textrm{var}(\\widehat f(x)) &= \\Expect{ \\left(a\\overline{Y}_n - \\Expect{a\\overline{Y}_n}\\right)^2}\n=a^2\\Expect{\\left(\\overline{Y}_n-\\mu\\right)^2}=\\frac{a^2}{n} \\\\\n\\sigma^2 &= \\Expect{(Y-\\mu)^2}=1\n\\end{aligned}\\]"
+ "section": "",
+ "text": "par(mfrow = c(1, 2), mar = c(5, 3, 3, 0))\nplot(lasso) # a plot method for the cv fit\nplot(lasso$glmnet.fit) # the glmnet.fit == glmnet(X,Y)\nabline(v = colSums(abs(coef(lasso$glmnet.fit)[-1, drop(lasso$index)])), lty = 2)"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#this-decomposition-holds-generally",
- "href": "schedule/slides/04-bias-variance.html#this-decomposition-holds-generally",
+ "objectID": "schedule/slides/09-l1-penalties.html#paths-with-chosen-lambda",
+ "href": "schedule/slides/09-l1-penalties.html#paths-with-chosen-lambda",
"title": "UBC Stat406 2023W",
- "section": "This decomposition holds generally",
- "text": "This decomposition holds generally\n\\[\\begin{aligned}\nR_n(\\hat{Y})\n&= \\Expect{(Y-\\hat{Y})^2} \\\\\n&= \\Expect{(Y-\\mu + \\mu - \\hat{Y})^2} \\\\\n&= \\Expect{(Y-\\mu)^2} + \\Expect{(\\mu - \\hat{Y})^2} +\n2\\Expect{(Y-\\mu)(\\mu-\\hat{Y})}\\\\\n&= \\Expect{(Y-\\mu)^2} + \\Expect{(\\mu - \\hat{Y})^2} + 0\\\\\n&= \\text{irr. error} + \\text{estimation risk}\\\\\n&= \\sigma^2 + \\Expect{(\\mu - E[\\hat{Y}] + E[\\hat{Y}] - \\hat{Y})^2}\\\\\n&= \\sigma^2 + \\Expect{(\\mu - E[\\hat{Y}])^2} + \\Expect{(E[\\hat{Y}] - \\hat{Y})^2} + 2\\Expect{(\\mu-E[\\hat{Y}])(E[\\hat{Y}] - \\hat{Y})}\\\\\n&= \\sigma^2 + \\Expect{(\\mu - E[\\hat{Y}])^2} + \\Expect{(E[\\hat{Y}] - \\hat{Y})^2} + 0\\\\\n&= \\text{irr. error} + \\text{squared bias} + \\text{variance}\n\\end{aligned}\\]"
+ "section": "Paths with chosen lambda",
+ "text": "Paths with chosen lambda\n\nridge <- cv.glmnet(X, Y, alpha = 0, lambda.min.ratio = 1e-10) # added to get a minimum\npar(mfrow = c(1, 4))\nplot(ridge, main = \"Ridge\")\nplot(lasso, main = \"Lasso\")\nplot(ridge$glmnet.fit, main = \"Ridge\")\nabline(v = sum(abs(coef(ridge)))) # defaults to `lambda.1se`\nplot(lasso$glmnet.fit, main = \"Lasso\")\nabline(v = sum(abs(coef(lasso)))) # again, `lambda.1se` unless told otherwise"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#bias-variance-decomposition-1",
- "href": "schedule/slides/04-bias-variance.html#bias-variance-decomposition-1",
+ "objectID": "schedule/slides/09-l1-penalties.html#degrees-of-freedom",
+ "href": "schedule/slides/09-l1-penalties.html#degrees-of-freedom",
"title": "UBC Stat406 2023W",
- "section": "Bias-variance decomposition",
- "text": "Bias-variance decomposition\n\\[\\begin{aligned}\nR_n(\\hat{Y})\n&= \\Expect{(Y-\\hat{Y})^2} \\\\\n&= \\text{irr. error} + \\text{estimation risk}\\\\\n&= \\text{irr. error} + \\text{squared bias} + \\text{variance}\n\\end{aligned}\\]\n\n\n\n\n\n\nImportant\n\n\n\nImplication: prediction risk is proportional to estimation risk. However, defining estimation risk requires stronger assumptions.\n\n\n\n\n\n\n\n\n\n\n\nTip\n\n\nIn order to make good predictions, we want our prediction risk to be small. This means that we want to “balance” the bias and variance."
+ "section": "Degrees of freedom",
+ "text": "Degrees of freedom\nLasso is not a linear smoother. There is no matrix \\(S\\) such that \\(\\widehat{\\y} = \\mathbf{S}\\y\\) for the predicted values from lasso.\n\nWe can’t use cv_nice().\nWe don’t have \\(\\tr{\\mathbf{S}} = \\textrm{df}\\) because there is no \\(\\mathbf{S}\\).\n\nHowever,\n\nOne can show that \\(\\textrm{df}_\\lambda = \\E[\\#(\\widehat{\\beta}_\\lambda \\neq 0)] = \\E[||\\widehat{\\beta}_\\lambda||_0]\\)\nThe proof is PhD-level material\n\nNote that the \\(\\widehat{\\textrm{df}}_\\lambda\\) is shown on the CV plot and that lasso.glmnet$glmnet.fit$df contains this value for all \\(\\lambda\\)."
},
{
- "objectID": "schedule/slides/04-bias-variance.html#section-2",
- "href": "schedule/slides/04-bias-variance.html#section-2",
+ "objectID": "schedule/slides/09-l1-penalties.html#other-flavours",
+ "href": "schedule/slides/09-l1-penalties.html#other-flavours",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Code\ncols = c(blue, red, green, orange)\npar(mfrow = c(2, 2), bty = \"n\", ann = FALSE, xaxt = \"n\", yaxt = \"n\", \n family = \"serif\", mar = c(0, 0, 0, 0), oma = c(0, 2, 2, 0))\nlibrary(mvtnorm)\nmv <- matrix(c(0, 0, 0, 0, -.5, -.5, -.5, -.5), 4, byrow = TRUE)\nva <- matrix(c(.02, .02, .1, .1, .02, .02, .1, .1), 4, byrow = TRUE)\n\nfor (i in 1:4) {\n plot(0, 0, ylim = c(-2, 2), xlim = c(-2, 2), pch = 19, cex = 42, \n col = blue, ann = FALSE, pty = \"s\")\n points(0, 0, pch = 19, cex = 30, col = \"white\")\n points(0, 0, pch = 19, cex = 18, col = green)\n points(0, 0, pch = 19, cex = 6, col = orange)\n points(rmvnorm(20, mean = mv[i, ], sigma = diag(va[i, ])), cex = 1, pch = 19)\n switch(i,\n \"1\" = {\n mtext(\"low variance\", 3, cex = 2)\n mtext(\"low bias\", 2, cex = 2)\n },\n \"2\" = mtext(\"high variance\", 3, cex = 2),\n \"3\" = mtext(\"high bias\", 2, cex = 2)\n )\n}"
+ "section": "Other flavours",
+ "text": "Other flavours\n\nThe elastic net\n\ngenerally used for correlated variables that combines a ridge/lasso penalty. Use glmnet(..., alpha = a) (0 < a < 1).\n\nGrouped lasso\n\nwhere variables are included or excluded in groups. Required for factors (1-hot encoding)\n\nRelaxed lasso\n\nTakes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Use glmnet(..., relax = TRUE).\n\nDantzig selector\n\na slightly modified version of the lasso"
},
{
- "objectID": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-overview",
- "href": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-overview",
+ "objectID": "schedule/slides/09-l1-penalties.html#lasso-cinematic-universe",
+ "href": "schedule/slides/09-l1-penalties.html#lasso-cinematic-universe",
"title": "UBC Stat406 2023W",
- "section": "Bias-variance tradeoff: Overview",
- "text": "Bias-variance tradeoff: Overview\nbias: how well does \\(\\widehat{f}(x)\\) approximate the truth \\(\\Expect{Y\\given X=x}\\)\n\nIf we allow more complicated possible \\(\\widehat{f}\\), lower bias. Flexibility \\(\\Rightarrow\\) Expressivity\nBut, more flexibility \\(\\Rightarrow\\) larger variance\nComplicated models are hard to estimate precisely for fixed \\(n\\)\nIrreducible error\n\n\n\nSadly, that whole exercise depends on knowing the truth to evaluate \\(E\\ldots\\)"
+ "section": "Lasso cinematic universe",
+ "text": "Lasso cinematic universe\n\n\n\nSCAD\n\na non-convex version of lasso that adds a more severe variable selection penalty\n\n\\(\\sqrt{\\textrm{lasso}}\\)\n\nclaims to be tuning parameter free (but isn’t). Uses \\(\\Vert\\cdot\\Vert_2\\) instead of \\(\\Vert\\cdot\\Vert_1\\) for the loss.\n\nGeneralized lasso\n\nAdds various additional matrices to the penalty term (e.g. \\(\\Vert D\\beta\\Vert_1\\)).\n\nArbitrary combinations\n\ncombine the above penalties in your favourite combinations"
},
{
- "objectID": "schedule/slides/02-lm-example.html#meta-lecture",
- "href": "schedule/slides/02-lm-example.html#meta-lecture",
+ "objectID": "schedule/slides/09-l1-penalties.html#warnings-on-regularized-regression",
+ "href": "schedule/slides/09-l1-penalties.html#warnings-on-regularized-regression",
"title": "UBC Stat406 2023W",
- "section": "02 Linear model example",
- "text": "02 Linear model example\nStat 406\nDaniel J. McDonald\nLast modified – 06 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\]"
+ "section": "Warnings on regularized regression",
+ "text": "Warnings on regularized regression\n\nThis isn’t a method unless you say how to choose \\(\\lambda\\).\nThe intercept is never penalized. Adds an extra degree-of-freedom.\nPredictor scaling is very important.\nDiscrete predictors need groupings.\nCentering the predictors is important\n(These all work with other likelihoods.)\n\n\nSoftware handles most of these automatically, but not always. (No Lasso with factor predictors.)"
},
{
- "objectID": "schedule/slides/02-lm-example.html#economic-mobility",
- "href": "schedule/slides/02-lm-example.html#economic-mobility",
+ "objectID": "schedule/slides/07-greedy-selection.html#meta-lecture",
+ "href": "schedule/slides/07-greedy-selection.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Economic mobility",
- "text": "Economic mobility\n\ndata(\"mobility\", package = \"Stat406\")\nmobility\n\n# A tibble: 741 × 43\n ID Name Mobility State Population Urban Black Seg_racial Seg_income\n <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>\n 1 100 Johnson Ci… 0.0622 TN 576081 1 0.021 0.09 0.035\n 2 200 Morristown 0.0537 TN 227816 1 0.02 0.093 0.026\n 3 301 Middlesbor… 0.0726 TN 66708 0 0.015 0.064 0.024\n 4 302 Knoxville 0.0563 TN 727600 1 0.056 0.21 0.092\n 5 401 Winston-Sa… 0.0448 NC 493180 1 0.174 0.262 0.072\n 6 402 Martinsvil… 0.0518 VA 92753 0 0.224 0.137 0.024\n 7 500 Greensboro 0.0474 NC 1055133 1 0.218 0.22 0.068\n 8 601 North Wilk… 0.0517 NC 90016 0 0.032 0.114 0.012\n 9 602 Galax 0.0796 VA 64676 0 0.029 0.131 0.005\n10 700 Spartanburg 0.0431 SC 354533 1 0.207 0.139 0.045\n# ℹ 731 more rows\n# ℹ 34 more variables: Seg_poverty <dbl>, Seg_affluence <dbl>, Commute <dbl>,\n# Income <dbl>, Gini <dbl>, Share01 <dbl>, Gini_99 <dbl>, Middle_class <dbl>,\n# Local_tax_rate <dbl>, Local_gov_spending <dbl>, Progressivity <dbl>,\n# EITC <dbl>, School_spending <dbl>, Student_teacher_ratio <dbl>,\n# Test_scores <dbl>, HS_dropout <dbl>, Colleges <dbl>, Tuition <dbl>,\n# Graduation <dbl>, Labor_force_participation <dbl>, Manufacturing <dbl>, …\n\n\n\nNote how many observations and predictors it has.\nWe’ll use Mobility as the response"
+ "section": "07 Greedy selection",
+ "text": "07 Greedy selection\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/02-lm-example.html#a-linear-model",
- "href": "schedule/slides/02-lm-example.html#a-linear-model",
+ "objectID": "schedule/slides/07-greedy-selection.html#recap",
+ "href": "schedule/slides/07-greedy-selection.html#recap",
"title": "UBC Stat406 2023W",
- "section": "A linear model",
- "text": "A linear model\n\\[\\mbox{Mobility}_i = \\beta_0 + \\beta_1 \\, \\mbox{State}_i + \\beta_2 \\, \\mbox{Urban}_i + \\cdots + \\epsilon_i\\]\nor equivalently\n\\[E \\left[ \\biggl. \\mbox{mobility} \\, \\biggr| \\, \\mbox{State}, \\mbox{Urban},\n \\ldots \\right] = \\beta_0 + \\beta_1 \\, \\mbox{State} +\n \\beta_2 \\, \\mbox{Urban} + \\cdots\\]"
+ "section": "Recap",
+ "text": "Recap\nModel Selection means select a family of distributions for your data.\nIdeally, we’d do this by comparing the \\(R_n\\) for one family with that for another.\nWe’d use whichever has smaller \\(R_n\\).\nBut \\(R_n\\) depends on the truth, so we estimate it with \\(\\widehat{R}\\).\nThen we use whichever has smaller \\(\\widehat{R}\\)."
},
{
- "objectID": "schedule/slides/02-lm-example.html#analysis",
- "href": "schedule/slides/02-lm-example.html#analysis",
+ "objectID": "schedule/slides/07-greedy-selection.html#example",
+ "href": "schedule/slides/07-greedy-selection.html#example",
"title": "UBC Stat406 2023W",
- "section": "Analysis",
- "text": "Analysis\n\nRandomly split into a training (say 3/4) and a test set (1/4)\nUse training set to fit a model\nFit the “full” model\n“Look” at the fit\n\n\n\nset.seed(20220914)\nmob <- mobility[complete.cases(mobility), ]\nn <- nrow(mob)\nmob <- mob |> select(-Name, -ID, -State)\nset <- sample.int(n, floor(n * .75), FALSE)\ntrain <- mob[set, ]\ntest <- mob[setdiff(1:n, set), ]\nfull <- lm(Mobility ~ ., data = train)\n\n\nWhy don’t we include Name or ID?"
+ "section": "Example",
+ "text": "Example\nThe truth:\n\ndat <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100),\n y = 3 + x1 - 5 * x2 + sin(x1 * x2 / (2 * pi)) + rnorm(100, sd = 5)\n)\n\nModel 1: y ~ x1 + x2\nModel 2: y ~ x1 + x2 + x1*x2\nModel 3: y ~ x2 + sin(x1 * x2)\n\n(What are the families for each of these?)"
},
{
- "objectID": "schedule/slides/02-lm-example.html#results",
- "href": "schedule/slides/02-lm-example.html#results",
+ "objectID": "schedule/slides/07-greedy-selection.html#fit-each-model-and-estimate-r_n",
+ "href": "schedule/slides/07-greedy-selection.html#fit-each-model-and-estimate-r_n",
"title": "UBC Stat406 2023W",
- "section": "Results",
- "text": "Results\n\nsummary(full)\n\n\nCall:\nlm(formula = Mobility ~ ., data = train)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.072092 -0.010256 -0.001452 0.009170 0.090428 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.849e-01 8.083e-02 2.288 0.022920 * \nPopulation 3.378e-09 2.478e-09 1.363 0.173916 \nUrban 2.853e-03 3.892e-03 0.733 0.464202 \nBlack 7.807e-02 2.859e-02 2.731 0.006735 ** \nSeg_racial -5.626e-02 1.780e-02 -3.160 0.001754 ** \nSeg_income 8.677e-01 9.355e-01 0.928 0.354453 \nSeg_poverty -7.416e-01 5.014e-01 -1.479 0.140316 \nSeg_affluence -2.224e-01 4.763e-01 -0.467 0.640874 \nCommute 6.313e-02 2.838e-02 2.225 0.026915 * \nIncome 4.207e-07 6.997e-07 0.601 0.548112 \nGini 3.592e+00 3.357e+00 1.070 0.285578 \nShare01 -3.635e-02 3.357e-02 -1.083 0.279925 \nGini_99 -3.657e+00 3.356e+00 -1.090 0.276704 \nMiddle_class 1.031e-01 4.835e-02 2.133 0.033828 * \nLocal_tax_rate 2.268e-01 2.620e-01 0.866 0.387487 \nLocal_gov_spending 1.273e-07 3.016e-06 0.042 0.966374 \nProgressivity 4.983e-03 1.324e-03 3.764 0.000205 ***\nEITC -3.324e-04 4.528e-04 -0.734 0.463549 \nSchool_spending -9.019e-04 2.272e-03 -0.397 0.691658 \nStudent_teacher_ratio -1.639e-03 1.123e-03 -1.459 0.145748 \nTest_scores 2.487e-04 3.137e-04 0.793 0.428519 \nHS_dropout -1.698e-01 9.352e-02 -1.816 0.070529 . \nColleges -2.811e-02 7.661e-02 -0.367 0.713942 \nTuition 3.459e-07 4.362e-07 0.793 0.428417 \nGraduation -1.702e-02 1.425e-02 -1.194 0.233650 \nLabor_force_participation -7.850e-02 5.405e-02 -1.452 0.147564 \nManufacturing -1.605e-01 2.816e-02 -5.700 3.1e-08 ***\nChinese_imports -5.165e-04 1.004e-03 -0.514 0.607378 \nTeenage_labor -1.019e+00 2.111e+00 -0.483 0.629639 \nMigration_in 4.490e-02 3.480e-01 0.129 0.897436 \nMigration_out -4.475e-01 4.093e-01 -1.093 0.275224 \nForeign_born 9.137e-02 5.494e-02 1.663 0.097454 . \nSocial_capital -1.114e-03 2.728e-03 -0.408 0.683245 \nReligious 4.570e-02 1.298e-02 3.520 0.000506 ***\nViolent_crime -3.393e+00 1.622e+00 -2.092 0.037373 * \nSingle_mothers -3.590e-01 9.442e-02 -3.802 0.000177 ***\nDivorced 1.707e-02 1.603e-01 0.107 0.915250 \nMarried -5.894e-02 7.246e-02 -0.813 0.416720 \nLongitude -4.239e-05 2.239e-04 -0.189 0.850001 \nLatitude 6.725e-04 5.687e-04 1.182 0.238037 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.02128 on 273 degrees of freedom\nMultiple R-squared: 0.7808, Adjusted R-squared: 0.7494 \nF-statistic: 24.93 on 39 and 273 DF, p-value: < 2.2e-16"
+ "section": "Fit each model and estimate \\(R_n\\)",
+ "text": "Fit each model and estimate \\(R_n\\)\n\nforms <- list(\"y ~ x1 + x2\", \"y ~ x1 * x2\", \"y ~ x2 + sin(x1*x2)\") |> \n map(as.formula)\nfits <- map(forms, ~ lm(.x, data = dat))\nmap(fits, ~ tibble(\n R2 = summary(.x)$r.sq,\n training_error = mean(residuals(.x)^2),\n loocv = mean( (residuals(.x) / (1 - hatvalues(.x)))^2 ),\n AIC = AIC(.x),\n BIC = BIC(.x)\n)) |> list_rbind()\n\n# A tibble: 3 × 5\n R2 training_error loocv AIC BIC\n <dbl> <dbl> <dbl> <dbl> <dbl>\n1 0.589 21.3 22.9 598. 608.\n2 0.595 21.0 23.4 598. 611.\n3 0.586 21.4 23.0 598. 609."
},
{
- "objectID": "schedule/slides/02-lm-example.html#diagnostic-plots",
- "href": "schedule/slides/02-lm-example.html#diagnostic-plots",
+ "objectID": "schedule/slides/07-greedy-selection.html#model-selection-vs.-variable-selection",
+ "href": "schedule/slides/07-greedy-selection.html#model-selection-vs.-variable-selection",
"title": "UBC Stat406 2023W",
- "section": "Diagnostic plots",
- "text": "Diagnostic plots\n\n\npar(mar = c(5, 3, 0, 0))\nplot(full, 1)\n\n\n\n\n\n\n\n\n\n\n\n\nplot(full, 2)"
+ "section": "Model Selection vs. Variable Selection",
+ "text": "Model Selection vs. Variable Selection\nModel selection is very comprehensive\nYou choose a full statistical model (probability distribution) that will be hypothesized to have generated the data.\nVariable selection is a subset of this. It means\n\nchoosing which predictors to include in a predictive model\n\nEliminating a predictor, means removing it from the model.\nSome procedures automatically search predictors, and eliminate some.\nWe call this variable selection. But the procedure is implicitly selecting a model as well.\n\nMaking this all the more complicated, with lots of effort, we can map procedures/algorithms to larger classes of probability models, and analyze them."
},
{
- "objectID": "schedule/slides/02-lm-example.html#section",
- "href": "schedule/slides/02-lm-example.html#section",
+ "objectID": "schedule/slides/07-greedy-selection.html#selecting-variables-predictors-with-linear-methods",
+ "href": "schedule/slides/07-greedy-selection.html#selecting-variables-predictors-with-linear-methods",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "(Those were plot methods for objects of class lm)\nSame thing in ggplot\n\n\nstuff <- tibble(\n residuals = residuals(full), \n fitted = fitted(full),\n stdresiduals = rstandard(full)\n)\nggplot(stuff, aes(fitted, residuals)) +\n geom_point(colour = \"salmon\") +\n geom_smooth(\n se = FALSE, \n colour = \"steelblue\", \n linewidth = 2) +\n ggtitle(\"Residuals vs Fitted\")\n\n\n\n\n\n\n\n\n\n\n\n\nggplot(stuff, aes(sample = stdresiduals)) +\n geom_qq(colour = \"purple\", size = 2) +\n geom_qq_line(colour = \"peachpuff\", linewidth = 2) +\n labs(\n x = \"Theoretical quantiles\", \n y = \"Standardized residuals\",\n title = \"Normal Q-Q\")"
+ "section": "Selecting variables / predictors with linear methods",
+ "text": "Selecting variables / predictors with linear methods\n\n\nSuppose we have a pile of predictors.\nWe estimate models with different subsets of predictors and use CV / Cp / AIC / BIC to decide which is preferred.\nSometimes you might have a few plausible subsets. Easy enough to choose with our criterion.\nSometimes you might just have a bunch of predictors, then what do you do?\n\n\n\nAll subsets\n\nestimate model based on every possible subset of size \\(|\\mathcal{S}| \\leq \\min\\{n, p\\}\\), use one with lowest risk estimate\n\nForward selection\n\nstart with \\(\\mathcal{S}=\\varnothing\\), add predictors greedily\n\nBackward selection\n\nstart with \\(\\mathcal{S}=\\{1,\\ldots,p\\}\\), remove greedily\n\nHybrid\n\ncombine forward and backward smartly"
},
{
- "objectID": "schedule/slides/02-lm-example.html#fit-a-reduced-model",
- "href": "schedule/slides/02-lm-example.html#fit-a-reduced-model",
+ "objectID": "schedule/slides/07-greedy-selection.html#costs-and-benefits",
+ "href": "schedule/slides/07-greedy-selection.html#costs-and-benefits",
"title": "UBC Stat406 2023W",
- "section": "Fit a reduced model",
- "text": "Fit a reduced model\n\nreduced <- lm(\n Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +\n Manufacturing + Migration_in + Religious + Single_mothers, \n data = train)\n\nsummary(reduced)$coefficients |> as_tibble()\n\n# A tibble: 9 × 4\n Estimate `Std. Error` `t value` `Pr(>|t|)`\n <dbl> <dbl> <dbl> <dbl>\n1 0.166 0.0178 9.36 1.83e-18\n2 0.0637 0.0149 4.27 2.62e- 5\n3 -0.109 0.0390 -2.79 5.64e- 3\n4 0.000500 0.000256 1.95 5.19e- 2\n5 -0.216 0.0820 -2.64 8.81e- 3\n6 -0.159 0.0202 -7.89 5.65e-14\n7 -0.389 0.172 -2.26 2.42e- 2\n8 0.0436 0.0105 4.16 4.08e- 5\n9 -0.286 0.0466 -6.15 2.44e- 9\n\nreduced |> broom::glance() |> print(width = 120)\n\n# A tibble: 1 × 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 0.718 0.711 0.0229 96.9 5.46e-79 8 743. -1466. -1429.\n deviance df.residual nobs\n <dbl> <int> <int>\n1 0.159 304 313"
+ "section": "Costs and benefits",
+ "text": "Costs and benefits\n\nAll subsets\n\n👍 estimates each subset\n💣 takes \\(2^p\\) model fits when \\(p<n\\). If \\(p=50\\), this is about \\(10^{15}\\) models.\n\nForward selection\n\n👍 computationally feasible\n💣 ignores some models, correlated predictors means bad performance\n\nBackward selection\n\n👍 computationally feasible\n💣 ignores some models, correlated predictors means bad performance\n💣 doesn’t work if \\(p>n\\)\n\nHybrid\n\n👍 visits more models than forward/backward\n💣 slower"
},
{
- "objectID": "schedule/slides/02-lm-example.html#diagnostic-plots-for-reduced-model",
- "href": "schedule/slides/02-lm-example.html#diagnostic-plots-for-reduced-model",
+ "objectID": "schedule/slides/07-greedy-selection.html#synthetic-example",
+ "href": "schedule/slides/07-greedy-selection.html#synthetic-example",
"title": "UBC Stat406 2023W",
- "section": "Diagnostic plots for reduced model",
- "text": "Diagnostic plots for reduced model\n\n\nplot(reduced, 1)\n\n\n\n\n\n\n\n\n\n\n\n\nplot(reduced, 2)"
+ "section": "Synthetic example",
+ "text": "Synthetic example\n\nset.seed(123)\nn <- 406\ndf <- tibble( # like data.frame, but columns can be functions of preceding\n x1 = rnorm(n),\n x2 = rnorm(n, mean = 2, sd = 1),\n x3 = rexp(n, rate = 1),\n x4 = x2 + rnorm(n, sd = .1), # correlated with x2\n x5 = x1 + rnorm(n, sd = .1), # correlated with x1\n x6 = x1 - x2 + rnorm(n, sd = .1), # correlated with x2 and x1 (and others)\n x7 = x1 + x3 + rnorm(n, sd = .1), # correlated with x1 and x3 (and others)\n y = x1 * 3 + x2 / 3 + rnorm(n, sd = 2.2) # function of x1 and x2 only\n)\n\n\n\\(\\mathbf{x}_1\\) and \\(\\mathbf{x}_2\\) are the true predictors\nBut the rest are correlated with them"
},
{
- "objectID": "schedule/slides/02-lm-example.html#how-do-we-decide-which-model-is-better",
- "href": "schedule/slides/02-lm-example.html#how-do-we-decide-which-model-is-better",
+ "objectID": "schedule/slides/07-greedy-selection.html#full-model",
+ "href": "schedule/slides/07-greedy-selection.html#full-model",
"title": "UBC Stat406 2023W",
- "section": "How do we decide which model is better?",
- "text": "How do we decide which model is better?\n\n\n\nGoodness of fit versus prediction power\n\n\nmap( # smaller AIC is better\n list(full = full, reduced = reduced), \n ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq))\n\n$full\n aic rsq \n-1482.5981023 0.7807509 \n\n$reduced\n aic rsq \n-1466.088492 0.718245 \n\n\n\nUse both models to predict Mobility\nCompare both sets of predictions\n\n\n\n\nmses <- function(preds, obs) {\n round(mean((obs - preds)^2), 5)\n}\nc(\n full = mses(\n predict(full, newdata = test), \n test$Mobility),\n reduced = mses(\n predict(reduced, newdata = test), \n test$Mobility)\n)\n\n full reduced \n0.00072 0.00084 \n\n\n\n\nCode\ntest$full <- predict(full, newdata = test)\ntest$reduced <- predict(reduced, newdata = test)\ntest |> \n select(Mobility, full, reduced) |>\n pivot_longer(-Mobility) |>\n ggplot(aes(Mobility, value)) + \n geom_point(color = \"orange\") + \n facet_wrap(~name, 2) +\n xlab('observed mobility') + \n ylab('predicted mobility') +\n geom_abline(slope = 1, intercept = 0, col = \"darkblue\")"
+ "section": "Full model",
+ "text": "Full model\n\nfull <- lm(y ~ ., data = df)\nsummary(full)\n\n\nCall:\nlm(formula = y ~ ., data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.7739 -1.4283 -0.0929 1.4257 7.5869 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.03383 0.27700 0.122 0.90287 \nx1 6.70481 2.06743 3.243 0.00128 **\nx2 -0.43945 1.71650 -0.256 0.79807 \nx3 1.37293 1.11524 1.231 0.21903 \nx4 -1.19911 1.17850 -1.017 0.30954 \nx5 -0.53918 1.07089 -0.503 0.61490 \nx6 -1.88547 1.21652 -1.550 0.12196 \nx7 -1.25245 1.10743 -1.131 0.25876 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.231 on 398 degrees of freedom\nMultiple R-squared: 0.6411, Adjusted R-squared: 0.6347 \nF-statistic: 101.5 on 7 and 398 DF, p-value: < 2.2e-16"
},
{
- "objectID": "schedule/slides/00-version-control.html#meta-lecture",
- "href": "schedule/slides/00-version-control.html#meta-lecture",
+ "objectID": "schedule/slides/07-greedy-selection.html#true-model",
+ "href": "schedule/slides/07-greedy-selection.html#true-model",
"title": "UBC Stat406 2023W",
- "section": "00 Git, Github, and Slack",
- "text": "00 Git, Github, and Slack\nStat 406\nDaniel J. McDonald\nLast modified – 11 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "True model",
+ "text": "True model\n\ntruth <- lm(y ~ x1 + x2, data = df)\nsummary(truth)\n\n\nCall:\nlm(formula = y ~ x1 + x2, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.4519 -1.3873 -0.1941 1.3498 7.5533 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.1676 0.2492 0.673 0.5015 \nx1 3.0316 0.1146 26.447 <2e-16 ***\nx2 0.2447 0.1109 2.207 0.0279 * \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.233 on 403 degrees of freedom\nMultiple R-squared: 0.6357, Adjusted R-squared: 0.6339 \nF-statistic: 351.6 on 2 and 403 DF, p-value: < 2.2e-16"
},
{
- "objectID": "schedule/slides/00-version-control.html#course-communication",
- "href": "schedule/slides/00-version-control.html#course-communication",
+ "objectID": "schedule/slides/07-greedy-selection.html#all-subsets",
+ "href": "schedule/slides/07-greedy-selection.html#all-subsets",
"title": "UBC Stat406 2023W",
- "section": "Course communication",
- "text": "Course communication\nWebsite:\nhttps://ubc-stat.github.io/stat-406/\n\nHosted on Github.\nLinks to slides and all materials\nSyllabus is there. Be sure to read it."
+ "section": "All subsets",
+ "text": "All subsets\n\nlibrary(leaps)\ntrythemall <- regsubsets(y ~ ., data = df)\nsummary(trythemall)\n\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df)\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: exhaustive\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \"*\" \" \"\n3 ( 1 ) \"*\" \" \" \" \" \"*\" \" \" \"*\" \" \"\n4 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \" \"\n5 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \"*\"\n6 ( 1 ) \"*\" \" \" \"*\" \"*\" \"*\" \"*\" \"*\"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\""
},
{
- "objectID": "schedule/slides/00-version-control.html#course-communication-1",
- "href": "schedule/slides/00-version-control.html#course-communication-1",
+ "objectID": "schedule/slides/07-greedy-selection.html#bic-and-cp",
+ "href": "schedule/slides/07-greedy-selection.html#bic-and-cp",
"title": "UBC Stat406 2023W",
- "section": "Course communication",
- "text": "Course communication\nSlack:\n\nLink to join on Canvas. This is our discussion board.\nNote that this data is hosted on servers outside of Canada. You may wish to use a pseudonym to protect your privacy.\nAnything super important will be posted to Slack and Canvas.\nBe sure you get Canvas email.\nIf I am sick, I will cancel class or arrange a substitute."
+ "section": "BIC and Cp",
+ "text": "BIC and Cp\n\n\ntibble(\n BIC = summary(trythemall)$bic, \n Cp = summary(trythemall)$cp,\n size = 1:7\n) |>\n pivot_longer(-size) |>\n ggplot(aes(size, value, colour = name)) + \n geom_point() + \n geom_line() + \n facet_wrap(~name, scales = \"free_y\") + \n ylab(\"\") +\n scale_colour_manual(\n values = c(blue, orange), \n guide = \"none\"\n )"
},
{
- "objectID": "schedule/slides/00-version-control.html#course-communication-2",
- "href": "schedule/slides/00-version-control.html#course-communication-2",
+ "objectID": "schedule/slides/07-greedy-selection.html#forward-stepwise",
+ "href": "schedule/slides/07-greedy-selection.html#forward-stepwise",
"title": "UBC Stat406 2023W",
- "section": "Course communication",
- "text": "Course communication\nGitHub organization\n\nLinked from the website.\nThis is where you complete / submit assignments / projects / in-class-work\nThis is also hosted on Servers outside Canada https://github.com/stat-406-2023/"
+ "section": "Forward stepwise",
+ "text": "Forward stepwise\n\nstepup <- regsubsets(y ~ ., data = df, method = \"forward\")\nsummary(stepup)\n\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df, method = \"forward\")\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: forward\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \"*\" \" \"\n3 ( 1 ) \"*\" \" \" \" \" \"*\" \" \" \"*\" \" \"\n4 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \" \"\n5 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \"*\"\n6 ( 1 ) \"*\" \" \" \"*\" \"*\" \"*\" \"*\" \"*\"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\""
},
{
- "objectID": "schedule/slides/00-version-control.html#why-these",
- "href": "schedule/slides/00-version-control.html#why-these",
+ "objectID": "schedule/slides/07-greedy-selection.html#bic-and-cp-1",
+ "href": "schedule/slides/07-greedy-selection.html#bic-and-cp-1",
"title": "UBC Stat406 2023W",
- "section": "Why these?",
- "text": "Why these?\n\nYes, some data is hosted on servers in the US.\nBut in the real world, no one uses Canvas / Piazza, so why not learn things they do use?\nMuch easier to communicate, “mark” or comment on your work\nMuch more DS friendly\nNote that MDS uses both of these, the Stat and CS departments use both, many faculty use them, Google / Amazon / Facebook use things like these, etc.\n\n\nSlack help from MDS features and rules"
+ "section": "BIC and Cp",
+ "text": "BIC and Cp\n\n\ntibble(\n BIC = summary(stepup)$bic,\n Cp = summary(stepup)$cp,\n size = 1:7\n) |>\n pivot_longer(-size) |>\n ggplot(aes(size, value, colour = name)) +\n geom_point() +\n geom_line() +\n facet_wrap(~name, scales = \"free_y\") +\n ylab(\"\") +\n scale_colour_manual(\n values = c(blue, orange),\n guide = \"none\"\n )"
},
{
- "objectID": "schedule/slides/00-version-control.html#why-version-control",
- "href": "schedule/slides/00-version-control.html#why-version-control",
+ "objectID": "schedule/slides/07-greedy-selection.html#backward-selection",
+ "href": "schedule/slides/07-greedy-selection.html#backward-selection",
"title": "UBC Stat406 2023W",
- "section": "Why version control?",
- "text": "Why version control?\n\nMuch of this lecture is based on material from Colin Rundel and Karl Broman"
+ "section": "Backward selection",
+ "text": "Backward selection\n\nstepdown <- regsubsets(y ~ ., data = df, method = \"backward\")\nsummary(stepdown)\n\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df, method = \"backward\")\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: backward\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \"*\" \" \"\n3 ( 1 ) \"*\" \" \" \" \" \"*\" \" \" \"*\" \" \"\n4 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \" \"\n5 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \"*\"\n6 ( 1 ) \"*\" \" \" \"*\" \"*\" \"*\" \"*\" \"*\"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\""
},
{
- "objectID": "schedule/slides/00-version-control.html#why-version-control-1",
- "href": "schedule/slides/00-version-control.html#why-version-control-1",
+ "objectID": "schedule/slides/07-greedy-selection.html#bic-and-cp-2",
+ "href": "schedule/slides/07-greedy-selection.html#bic-and-cp-2",
"title": "UBC Stat406 2023W",
- "section": "Why version control?",
- "text": "Why version control?\n\nSimple formal system for tracking all changes to a project\nTime machine for your projects\n\nTrack blame and/or praise\nRemove the fear of breaking things\n\nLearning curve is steep, but when you need it you REALLY need it\n\n\n\n\nWords of wisdom\n\n\nYour closest collaborator is you six months ago, but you don’t reply to emails.\n– Paul Wilson"
+ "section": "BIC and Cp",
+ "text": "BIC and Cp\n\n\ntibble(\n BIC = summary(stepdown)$bic,\n Cp = summary(stepdown)$cp,\n size = 1:7\n) |>\n pivot_longer(-size) |>\n ggplot(aes(size, value, colour = name)) +\n geom_point() +\n geom_line() +\n facet_wrap(~name, scales = \"free_y\") +\n ylab(\"\") +\n scale_colour_manual(\n values = c(blue, orange), \n guide = \"none\"\n )"
},
{
- "objectID": "schedule/slides/00-version-control.html#why-git",
- "href": "schedule/slides/00-version-control.html#why-git",
+ "objectID": "schedule/slides/07-greedy-selection.html#section",
+ "href": "schedule/slides/07-greedy-selection.html#section",
"title": "UBC Stat406 2023W",
- "section": "Why Git",
- "text": "Why Git\n\n\n\nYou could use something like Box or Dropbox\nThese are poor-man’s version control\nGit is much more appropriate\nIt works with large groups\nIt’s very fast\nIt’s much better at fixing mistakes\nTech companies use it (so it’s in your interest to have some experience)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThis will hurt, but what doesn’t kill you, makes you stronger."
+ "section": "",
+ "text": "somehow, for this seed, everything is the same"
},
{
- "objectID": "schedule/slides/00-version-control.html#overview",
- "href": "schedule/slides/00-version-control.html#overview",
+ "objectID": "schedule/slides/07-greedy-selection.html#randomness-and-prediction-error",
+ "href": "schedule/slides/07-greedy-selection.html#randomness-and-prediction-error",
"title": "UBC Stat406 2023W",
- "section": "Overview",
- "text": "Overview\n\ngit is a command line program that lives on your machine\nIf you want to track changes in a directory, you type git init\nThis creates a (hidden) directory called .git\nThe .git directory contains a history of all changes made to “versioned” files\nThis top directory is referred to as a “repository” or “repo”\nhttp://github.com is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders .ipynb & .md\nSome IDEs (pycharm, RStudio, VScode) have built in git\ngit/GitHub is broad and complicated. Here, just what you need"
+ "section": "Randomness and prediction error",
+ "text": "Randomness and prediction error\nAll of that was for one data set.\nDoesn’t say which procedure is better generally.\nIf we want to know how they compare generally, we should repeat many times\n\nGenerate training data\nEstimate with different algorithms\nPredict held-out set data\nExamine prediction MSE (on held-out set)\n\n\nI’m not going to do all subsets, just the truth, forward selection, backward, and the full model\nFor forward/backward selection, I’ll use Cp to choose the final size"
},
{
- "objectID": "schedule/slides/00-version-control.html#aside-on-built-in-command-line",
- "href": "schedule/slides/00-version-control.html#aside-on-built-in-command-line",
+ "objectID": "schedule/slides/07-greedy-selection.html#code-for-simulation",
+ "href": "schedule/slides/07-greedy-selection.html#code-for-simulation",
"title": "UBC Stat406 2023W",
- "section": "Aside on “Built-in” & “Command line”",
- "text": "Aside on “Built-in” & “Command line”\n\n\n\n\n\n\nTip\n\n\nFirst things first, RStudio and the Terminal\n\n\n\n\nCommand line is the “old” type of computing. You type commands at a prompt and the computer “does stuff”.\nYou may not have seen where this is. RStudio has one built in called “Terminal”\nThe Mac System version is also called “Terminal”. If you have a Linux machine, this should all be familiar.\nWindows is not great at this.\nTo get the most out of Git, you have to use the command line."
+ "section": "Code for simulation",
+ "text": "Code for simulation\n… Annoyingly, no predict method for regsubsets, so we make one.\n\npredict.regsubsets <- function(object, newdata, risk_estimate = c(\"cp\", \"bic\"), ...) {\n risk_estimate <- match.arg(risk_estimate)\n chosen <- coef(object, which.min(summary(object)[[risk_estimate]]))\n predictors <- names(chosen)\n if (object$intercept) predictors <- predictors[-1]\n X <- newdata[, predictors]\n if (object$intercept) X <- cbind2(1, X)\n drop(as.matrix(X) %*% chosen)\n}"
},
{
- "objectID": "schedule/slides/00-version-control.html#typical-workflow",
- "href": "schedule/slides/00-version-control.html#typical-workflow",
+ "objectID": "schedule/slides/07-greedy-selection.html#section-1",
+ "href": "schedule/slides/07-greedy-selection.html#section-1",
"title": "UBC Stat406 2023W",
- "section": "Typical workflow",
- "text": "Typical workflow\n\nDownload a repo from Github\n\ngit clone https://github.com/stat550-2021/lecture-slides.git\n\nCreate a branch\n\ngit branch <branchname>\n\nMake changes to your files.\nAdd your changes to be tracked (“stage” them)\n\ngit add <name/of/tracked/file>\n\nCommit your changes\n\ngit commit -m \"Some explanatory message\"\nRepeat 3–5 as needed. Once you’re satisfied\n\nPush to GitHub\n\ngit push\ngit push -u origin <branchname>"
+ "section": "",
+ "text": "simulate_and_estimate_them_all <- function(n = 406) {\n N <- 2 * n # generate 2x the amount of data (half train, half test)\n df <- tibble( # generate data\n x1 = rnorm(N), \n x2 = rnorm(N, mean = 2), \n x3 = rexp(N),\n x4 = x2 + rnorm(N, sd = .1), \n x5 = x1 + rnorm(N, sd = .1),\n x6 = x1 - x2 + rnorm(N, sd = .1), \n x7 = x1 + x3 + rnorm(N, sd = .1),\n y = x1 * 3 + x2 / 3 + rnorm(N, sd = 2.2)\n )\n train <- df[1:n, ] # half the data for training\n test <- df[(n + 1):N, ] # half the data for evaluation\n \n oracle <- lm(y ~ x1 + x2 - 1, data = train) # knowing the right model, not the coefs\n full <- lm(y ~ ., data = train)\n stepup <- regsubsets(y ~ ., data = train, method = \"forward\")\n stepdown <- regsubsets(y ~ ., data = train, method = \"backward\")\n \n tibble(\n y = test$y,\n oracle = predict(oracle, newdata = test),\n full = predict(full, newdata = test),\n stepup = predict(stepup, newdata = test),\n stepdown = predict(stepdown, newdata = test),\n truth = drop(as.matrix(test[, c(\"x1\", \"x2\")]) %*% c(3, 1/3))\n )\n}\n\nset.seed(12345)\nour_sim <- map(1:50, ~ simulate_and_estimate_them_all(406)) |>\n list_rbind(names_to = \"sim\")"
},
{
- "objectID": "schedule/slides/00-version-control.html#what-should-be-tracked",
- "href": "schedule/slides/00-version-control.html#what-should-be-tracked",
+ "objectID": "schedule/slides/07-greedy-selection.html#what-is-oracle",
+ "href": "schedule/slides/07-greedy-selection.html#what-is-oracle",
"title": "UBC Stat406 2023W",
- "section": "What should be tracked?",
- "text": "What should be tracked?\n\n\nDefinitely\n\ncode, markdown documentation, tex files, bash scripts/makefiles, …\n\n\n\n\nPossibly\n\nlogs, jupyter notebooks, images (that won’t change), …\n\n\n\n\nQuestionable\n\nprocessed data, static pdfs, …\n\n\n\n\nDefinitely not\n\nfull data, continually updated pdfs, other things compiled from source code, …"
+ "section": "What is “Oracle”",
+ "text": "What is “Oracle”"
},
{
- "objectID": "schedule/slides/00-version-control.html#what-things-to-track",
- "href": "schedule/slides/00-version-control.html#what-things-to-track",
+ "objectID": "schedule/slides/07-greedy-selection.html#results",
+ "href": "schedule/slides/07-greedy-selection.html#results",
"title": "UBC Stat406 2023W",
- "section": "What things to track",
- "text": "What things to track\n\nYou decide what is “versioned”.\nA file called .gitignore tells git files or types to never track\n\n# History files\n.Rhistory\n.Rapp.history\n\n# Session Data files\n.RData\n\n# User-specific files\n.Ruserdata\n\n# Compiled junk\n*.o\n*.so\n*.DS_Store\n\nShortcut to track everything (use carefully):\n\ngit add ."
+ "section": "Results",
+ "text": "Results\n\n\nour_sim |> \n group_by(sim) %>%\n summarise(\n across(oracle:truth, ~ mean((y - .)^2)), \n .groups = \"drop\"\n ) %>%\n transmute(across(oracle:stepdown, ~ . / truth - 1)) |> \n pivot_longer(\n everything(), \n names_to = \"method\", \n values_to = \"mse\"\n ) |> \n ggplot(aes(method, mse, fill = method)) +\n geom_boxplot(notch = TRUE) +\n geom_hline(yintercept = 0, linewidth = 2) +\n scale_fill_viridis_d() +\n theme(legend.position = \"none\") +\n scale_y_continuous(\n labels = scales::label_percent()\n ) +\n ylab(\"% increase in mse relative\\n to the truth\")"
},
{
- "objectID": "schedule/slides/00-version-control.html#rules",
- "href": "schedule/slides/00-version-control.html#rules",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#meta-lecture",
+ "href": "schedule/slides/05-estimating-test-mse.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Rules",
- "text": "Rules\nHomework and Labs\n\nYou each have your own repo\nYou make a branch\nDO NOT rename files\nMake enough commits (3 for labs, 5 for HW).\nPush your changes (at anytime) and make a PR against main when done.\nTAs review your work.\nOn HW, if you want to revise, make changes in response to feedback and push to the same branch. Then “re-request review”."
+ "section": "05 Estimating test MSE",
+ "text": "05 Estimating test MSE\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/00-version-control.html#whats-a-pr",
- "href": "schedule/slides/00-version-control.html#whats-a-pr",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#estimating-prediction-risk",
+ "href": "schedule/slides/05-estimating-test-mse.html#estimating-prediction-risk",
"title": "UBC Stat406 2023W",
- "section": "What’s a PR?",
- "text": "What’s a PR?\n\nThis exists on GitHub (not git)\nDemonstration"
+ "section": "Estimating prediction risk",
+ "text": "Estimating prediction risk\nLast time, we saw\n\\(R_n(\\widehat{f}) = E[(Y-\\widehat{f}(X))^2]\\)\nprediction risk = \\(\\textrm{bias}^2\\) + variance + irreducible error\nWe argued that we want procedures that produce \\(\\widehat{f}\\) with small \\(R_n\\).\n\nHow do we estimate \\(R_n\\)?"
},
{
- "objectID": "schedule/slides/00-version-control.html#whats-a-pr-1",
- "href": "schedule/slides/00-version-control.html#whats-a-pr-1",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#dont-use-training-error",
+ "href": "schedule/slides/05-estimating-test-mse.html#dont-use-training-error",
"title": "UBC Stat406 2023W",
- "section": "What’s a PR?",
- "text": "What’s a PR?\n\nThis exists on GitHub (not git)\nDemonstration"
+ "section": "Don’t use training error",
+ "text": "Don’t use training error\nThe training error in regression is\n\\[\\widehat{R}_n(\\widehat{f}) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{f}(x_i))^2\\]\nHere, the \\(n\\) is doubly used (annoying, but simple): \\(n\\) observations to create \\(\\widehat{f}\\) and \\(n\\) terms in the sum.\n\n\n\n\n\n\nImportant\n\n\nTraining error is a bad estimator for \\(R_n(\\widehat{f})\\).\n\n\n\nSo we should never use it."
},
{
- "objectID": "schedule/slides/00-version-control.html#some-things-to-be-aware-of",
- "href": "schedule/slides/00-version-control.html#some-things-to-be-aware-of",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#these-all-have-the-same-r2-and-training-error",
+ "href": "schedule/slides/05-estimating-test-mse.html#these-all-have-the-same-r2-and-training-error",
"title": "UBC Stat406 2023W",
- "section": "Some things to be aware of",
- "text": "Some things to be aware of\n\nmaster vs main\nIf you think you did something wrong, stop and ask for help\nThere are guardrails in place. But those won’t stop a bulldozer.\nThe hardest part is the initial setup. Then, this should all be rinse-and-repeat.\nThis book is great: Happy Git with R\n\nSee Chapter 6 if you have install problems.\nSee Chapter 9 for credential caching (avoid typing a password all the time)\nSee Chapter 13 if RStudio can’t find git"
+ "section": "These all have the same \\(R^2\\) and Training Error",
+ "text": "These all have the same \\(R^2\\) and Training Error\n\n\n\n\nCode\nans <- anscombe |>\n pivot_longer(everything(), names_to = c(\".value\", \"set\"), \n names_pattern = \"(.)(.)\")\nggplot(ans, aes(x, y)) + \n geom_point(colour = orange, size = 3) + \n geom_smooth(method = \"lm\", se = FALSE, color = blue, linewidth = 2) +\n facet_wrap(~set, labeller = label_both)\n\n\n\n\n\n\n\n\n\n\n\n\nans %>% \n group_by(set) |> \n summarise(\n R2 = summary(lm(y ~ x))$r.sq, \n train_error = mean((y - predict(lm(y ~ x)))^2)\n ) |>\n kableExtra::kable(digits = 2)\n\n\n\n\nset\nR2\ntrain_error\n\n\n\n\n1\n0.67\n1.25\n\n\n2\n0.67\n1.25\n\n\n3\n0.67\n1.25\n\n\n4\n0.67\n1.25"
},
{
- "objectID": "schedule/slides/00-version-control.html#the-maindevelopbranch-workflow",
- "href": "schedule/slides/00-version-control.html#the-maindevelopbranch-workflow",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#adding-junk-predictors-increases-r2-and-decreases-training-error",
+ "href": "schedule/slides/05-estimating-test-mse.html#adding-junk-predictors-increases-r2-and-decreases-training-error",
"title": "UBC Stat406 2023W",
- "section": "The main/develop/branch workflow",
- "text": "The main/develop/branch workflow\n\nWhen working on your own\n\nDon’t NEED branches (but you should use them, really)\nI make a branch if I want to try a modification without breaking what I have.\n\nWhen working on a large team with production grade software\n\nmain is protected, released version of software (maybe renamed to release)\ndevelop contains things not yet on main, but thoroughly tested\nOn a schedule (once a week, once a month) develop gets merged to main\nYou work on a feature branch off develop to build your new feature\nYou do a PR against develop. Supervisors review your contributions\n\n\n\nI and many DS/CS/Stat faculty use this workflow with my lab."
+ "section": "Adding “junk” predictors increases \\(R^2\\) and decreases Training Error",
+ "text": "Adding “junk” predictors increases \\(R^2\\) and decreases Training Error\n\nn <- 100\np <- 10\nq <- 0:30\nx <- matrix(rnorm(n * (p + max(q))), nrow = n)\ny <- x[, 1:p] %*% c(5:1, 1:5) + rnorm(n, 0, 10)\n\nregress_on_junk <- function(q) {\n x <- x[, 1:(p + q)]\n mod <- lm(y ~ x)\n tibble(R2 = summary(mod)$r.sq, train_error = mean((y - predict(mod))^2))\n}\n\n\n\nCode\nmap(q, regress_on_junk) |> \n list_rbind() |>\n mutate(q = q) |>\n pivot_longer(-q) |>\n ggplot(aes(q, value, colour = name)) +\n geom_line(linewidth = 2) + xlab(\"train_error\") +\n scale_colour_manual(values = c(blue, orange), guide = \"none\") +\n facet_wrap(~ name, scales = \"free_y\")"
},
{
- "objectID": "schedule/slides/00-version-control.html#protection",
- "href": "schedule/slides/00-version-control.html#protection",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#other-things-you-cant-use",
+ "href": "schedule/slides/05-estimating-test-mse.html#other-things-you-cant-use",
"title": "UBC Stat406 2023W",
- "section": "Protection",
- "text": "Protection\n\nTypical for your PR to trigger tests to make sure you don’t break things\nTypical for team members or supervisors to review your PR for compliance"
+ "section": "Other things you can’t use",
+ "text": "Other things you can’t use\nYou should not use anova\nor the \\(p\\)-values from the lm output for this purpose.\n\nThese things are to determine whether those parameters are different from zero if you were to repeat the experiment many times, if the model were true, etc. etc.\n\nNot the same as “are they useful for prediction = do they help me get smaller \\(R_n\\)?”"
},
{
- "objectID": "schedule/slides/00-version-control.html#guardrails",
- "href": "schedule/slides/00-version-control.html#guardrails",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#risk-of-risk",
+ "href": "schedule/slides/05-estimating-test-mse.html#risk-of-risk",
"title": "UBC Stat406 2023W",
- "section": "Guardrails",
- "text": "Guardrails\n\nThe .github directory contains interactions with GitHub\n\nActions: On push / PR / other GitHub does something on their server (builds a website, runs tests on code)\nPR templates: Little admonitions when you open a PR\nBranch protection: prevent you from doing stuff\n\nIn this course, I protect main so that you can’t push there\n\n\n\n\n\n\n\nWarning\n\n\nIf you try to push to main, it will give an error like\nremote: error: GH006: Protected branch update failed for refs/heads/main.\nThe fix is: make a new branch, then push that."
+ "section": "Risk of Risk",
+ "text": "Risk of Risk\nWhile it’s crummy, Training Error is an estimator of \\(R_n(\\hat{f})\\)\nRecall, \\(R_n(\\hat{f})\\) is a parameter (a property of the data distribution)\nSo we can ask “is \\(\\widehat{R}(\\hat{f})\\) a good estimator for \\(R_n(\\hat{f})\\)?”\nBoth are just numbers, so perhaps a good way to measure is\n\\[\nE[(R_n - \\widehat{R})^2]\n= \\cdots\n= (R_n - E[\\widehat{R}])^2 + \\Var{\\widehat{R}}\n\\]\nChoices you make determine how good this is.\nWe can try to balance it’s bias and variance…"
},
{
- "objectID": "schedule/slides/00-version-control.html#operations-in-rstudio",
- "href": "schedule/slides/00-version-control.html#operations-in-rstudio",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#held-out-sets",
+ "href": "schedule/slides/05-estimating-test-mse.html#held-out-sets",
"title": "UBC Stat406 2023W",
- "section": "Operations in Rstudio",
- "text": "Operations in Rstudio\n\n\n\nStage\nCommit\nPush\nPull\nCreate a branch\n\nCovers:\n\nEverything to do your HW / Project if you’re careful\nPlus most other things you “want to do”\n\n\n\nCommand line versions (of the same)\ngit add <name/of/file>\n\ngit commit -m \"some useful message\"\n\ngit push\n\ngit pull\n\ngit checkout -b <name/of/branch>"
+ "section": "Held out sets",
+ "text": "Held out sets\nOne option is to have a separate “held out” or “validation set”.\n👍 Estimates the test error\n👍 Fast computationally\n🤮 Estimate is random\n🤮 Estimate has high variance (depends on 1 choice of split)\n🤮 Estimate has some bias because we only used some of the data"
},
{
- "objectID": "schedule/slides/00-version-control.html#other-useful-stuff-but-command-line-only",
- "href": "schedule/slides/00-version-control.html#other-useful-stuff-but-command-line-only",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#aside",
+ "href": "schedule/slides/05-estimating-test-mse.html#aside",
"title": "UBC Stat406 2023W",
- "section": "Other useful stuff (but command line only)",
- "text": "Other useful stuff (but command line only)\n\n\nInitializing\ngit config user.name --global \"Daniel J. McDonald\"\ngit config user.email --global \"daniel@stat.ubc.ca\"\ngit config core.editor --global nano \n# or emacs or ... (default is vim)\nStaging\ngit add name/of/file # stage 1 file\ngit add . # stage all\nCommitting\n# stage/commit simultaneously\ngit commit -am \"message\" \n\n# open editor to write long commit message\ngit commit \nPushing\n# If branchname doesn't exist\n# on remote, create it and push\ngit push -u origin branchname\n\n\nBranching\n# switch to branchname, error if uncommitted changes\ngit checkout branchname \n# switch to a previous commit\ngit checkout aec356\n\n# create a new branch\ngit branch newbranchname\n# create a new branch and check it out\ngit checkout -b newbranchname\n\n# merge changes in branch2 onto branch1\ngit checkout branch1\ngit merge branch2\n\n# grab a file from branch2 and put it on current\ngit checkout branch2 -- name/of/file\n\ngit branch -v # list all branches\nCheck the status\ngit status\ngit remote -v # list remotes\ngit log # show recent commits, msgs"
+ "section": "Aside",
+ "text": "Aside\nIn my experience, CS has particular definitions of “training”, “validation”, and “test” data.\nI think these are not quite the same as in Statistics.\n\nTest data - Hypothetical data you don’t get to see, ever. Infinite amounts drawn from the population.\n\nExpected test error or Risk is an expected value over this distribution. It’s not a sum over some data kept aside.\n\nSometimes I’ll give you “test data”. You pretend that this is a good representation of the expectation and use it to see how well you did on the training data.\nTraining data - This is data that you get to touch.\nValidation set - Often, we need to choose models. One way to do this is to split off some of your training data and pretend that it’s like a “Test Set”.\n\nWhen and how you split your training data can be very important."
},
{
- "objectID": "schedule/slides/00-version-control.html#conflicts",
- "href": "schedule/slides/00-version-control.html#conflicts",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#intuition-for-cv",
+ "href": "schedule/slides/05-estimating-test-mse.html#intuition-for-cv",
"title": "UBC Stat406 2023W",
- "section": "Conflicts",
- "text": "Conflicts\n\nSometimes you merge things and “conflicts” happen.\nMeaning that changes on one branch would overwrite changes on a different branch.\n\n\n\n\nThey look like this:\n\nHere are lines that are either unchanged from\nthe common ancestor, or cleanly resolved \nbecause only one side changed.\n\nBut below we have some troubles\n<<<<<<< yours:sample.txt\nConflict resolution is hard;\nlet's go shopping.\n=======\nGit makes conflict resolution easy.\n>>>>>>> theirs:sample.txt\n\nAnd here is another line that is cleanly \nresolved or unmodified.\n\n\nYou get to decide, do you want to keep\n\nYour changes (above ======)\nTheir changes (below ======)\nBoth.\nNeither.\n\nBut always delete the <<<<<, ======, and >>>>> lines.\nOnce you’re satisfied, committing resolves the conflict."
+ "section": "Intuition for CV",
+ "text": "Intuition for CV\nOne reason that \\(\\widehat{R}_n(\\widehat{f})\\) is bad is that we are using the same data to pick \\(\\widehat{f}\\) AND to estimate \\(R_n\\).\n“Validation set” fixes this, but holds out a particular, fixed block of data we pretend mimics the “test data”\n\nWhat if we set aside one observation, say the first one \\((y_1, x_1)\\).\nWe estimate \\(\\widehat{f}^{(1)}\\) without using the first observation.\nThen we test our prediction:\n\\[\\widetilde{R}_1(\\widehat{f}^{(1)}) = (y_1 -\\widehat{f}^{(1)}(x_1))^2.\\]\n(why the notation \\(\\widetilde{R}_1\\)? Because we’re estimating the risk with 1 observation. )"
},
{
- "objectID": "schedule/slides/00-version-control.html#some-other-pointers",
- "href": "schedule/slides/00-version-control.html#some-other-pointers",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#keep-going",
+ "href": "schedule/slides/05-estimating-test-mse.html#keep-going",
"title": "UBC Stat406 2023W",
- "section": "Some other pointers",
- "text": "Some other pointers\n\nCommits have long names: 32b252c854c45d2f8dfda1076078eae8d5d7c81f\n\nIf you want to use it, you need “enough to be unique”: 32b25\n\nOnline help uses directed graphs in ways different from statistics:\n\nIn stats, arrows point from cause to effect, forward in time\nIn git docs, it’s reversed, they point to the thing on which they depend\n\n\nCheat sheet\nhttps://training.github.com/downloads/github-git-cheat-sheet.pdf"
+ "section": "Keep going",
+ "text": "Keep going\nBut that was only one data point \\((y_1, x_1)\\). Why stop there?\nDo the same with \\((y_2, x_2)\\)! Get an estimate \\(\\widehat{f}^{(2)}\\) without using it, then\n\\[\\widetilde{R}_1(\\widehat{f}^{(2)}) = (y_2 -\\widehat{f}^{(2)}(x_2))^2.\\]\nWe can keep doing this until we try it for every data point.\nAnd then average them! (Averages are good)\n\\[\\mbox{LOO-CV} = \\frac{1}{n}\\sum_{i=1}^n \\widetilde{R}_1(\\widehat{f}^{(i)}) = \\frac{1}{n}\\sum_{i=1}^n\n(y_i - \\widehat{f}^{(i)}(x_i))^2\\]\n\nThis is leave-one-out cross validation"
},
{
- "objectID": "schedule/slides/00-version-control.html#how-to-undo-in-3-scenarios",
- "href": "schedule/slides/00-version-control.html#how-to-undo-in-3-scenarios",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#problems-with-loo-cv",
+ "href": "schedule/slides/05-estimating-test-mse.html#problems-with-loo-cv",
"title": "UBC Stat406 2023W",
- "section": "How to undo in 3 scenarios",
- "text": "How to undo in 3 scenarios\n\nSuppose we’re concerned about a file named README.md\nOften, git status will give some of these as suggestions\n\n\n\n1. Saved but not staged\n\nIn RStudio, select the file and click then select Revert…\n\n# grab the previously committed version\ngit checkout -- README.md \n2. Staged but not committed\n\nIn RStudio, uncheck the box by the file, then use the method above.\n\n# unstage\ngit reset HEAD README.md\ngit checkout -- README.md\n\n\n3. Committed\n\nNot easy to do in RStudio…\n\n# check the log to see where you made the chg, \ngit log\n# go one step before that (eg to 32b252)\n# and grab that earlier version\ngit checkout 32b252 -- README.md\n\n# alternatively\n# if it happens to also be on another branch\ngit checkout otherbranch -- README.md"
+ "section": "Problems with LOO-CV",
+ "text": "Problems with LOO-CV\n🤮 Each held out set is small \\((n=1)\\). Therefore, the variance of the Squared Error of each prediction is high.\n🤮 The training sets overlap. This is bad.\n\nUsually, averaging reduces variance: \\(\\Var{\\overline{X}} = \\frac{1}{n^2}\\sum_{i=1}^n \\Var{X_i} = \\frac{1}{n}\\Var{X_1}.\\)\nBut only if the variables are independent. If not, then \\(\\Var{\\overline{X}} = \\frac{1}{n^2}\\Var{ \\sum_{i=1}^n X_i} = \\frac{1}{n}\\Var{X_1} + \\frac{1}{n^2}\\sum_{i\\neq j} \\Cov{X_i}{X_j}.\\)\nSince the training sets overlap a lot, that covariance can be pretty big.\n\n🤮 We have to estimate this model \\(n\\) times.\n🎉 Bias is low because we used almost all the data to fit the model: \\(E[\\mbox{LOO-CV}] = R_{n-1}\\)"
},
{
- "objectID": "schedule/slides/00-version-control.html#recovering-from-things",
- "href": "schedule/slides/00-version-control.html#recovering-from-things",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#k-fold-cv",
+ "href": "schedule/slides/05-estimating-test-mse.html#k-fold-cv",
"title": "UBC Stat406 2023W",
- "section": "Recovering from things",
- "text": "Recovering from things\n\nAccidentally did work on main, Tried to Push but got refused\n\n# make a new branch with everything, but stay on main\ngit branch newbranch\n# find out where to go to\ngit log\n# undo everything after ace2193\ngit reset --hard ace2193\ngit checkout newbranch\n\nMade a branch, did lots of work, realized it’s trash, and you want to burn it\n\ngit checkout main\ngit branch -d badbranch\n\nAnything more complicated, either post to Slack or LMGTFY\nIn the Lab next week, you’ll practice\n\nDoing it right.\nRecovering from some mistakes."
+ "section": "K-fold CV",
+ "text": "K-fold CV\n\n\nTo alleviate some of these problems, people usually use \\(K\\)-fold cross validation.\nThe idea of \\(K\\)-fold is\n\nDivide the data into \\(K\\) groups.\nLeave a group out and estimate with the rest.\nTest on the held-out group. Calculate an average risk over these \\(\\sim n/K\\) data.\nRepeat for all \\(K\\) groups.\nAverage the average risks.\n\n\n\n🎉 Less overlap, smaller covariance.\n🎉 Larger hold-out sets, smaller variance.\n🎉 Less computations (only need to estimate \\(K\\) times)\n🤮 LOO-CV is (nearly) unbiased for \\(R_n\\)\n🤮 K-fold CV is unbiased for \\(R_{n(1-1/K)}\\)\nThe risk depends on how much data you use to estimate the model. \\(R_n\\) depends on \\(n\\)."
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#meta-lecture",
- "href": "schedule/slides/00-quiz-0-wrap.html#meta-lecture",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#a-picture",
+ "href": "schedule/slides/05-estimating-test-mse.html#a-picture",
"title": "UBC Stat406 2023W",
- "section": "00 Quiz 0 fun",
- "text": "00 Quiz 0 fun\nStat 406\nDaniel J. McDonald\nLast modified – 13 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "A picture",
+ "text": "A picture\n\n\nCode\npar(mar = c(0, 0, 0, 0))\nplot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = \"n\", yaxt = \"n\", xaxt = \"n\")\nrect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)\nrect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10), \n rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)\npoints(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)\ntext(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c(\"1\", \"2\", \"3\", \"K\"), cex = 3, \n col = red)\ntext(6, 4.5, \"Training data\", cex = 3, col = blue)\ntext(2, 1.5, \"Validation data\", cex = 3, col = red)"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#why-this-class",
- "href": "schedule/slides/00-quiz-0-wrap.html#why-this-class",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#code",
+ "href": "schedule/slides/05-estimating-test-mse.html#code",
"title": "UBC Stat406 2023W",
- "section": "Why this class?",
- "text": "Why this class?\n\nMost say requirements.\nInterest in ML/Stat learning\nExpressions of love/affection for Stats/CS/ML\nEnjoyment of past similar classes"
+ "section": "Code",
+ "text": "Code\n\n#' @param data The full data set\n#' @param estimator Function. Has 1 argument (some data) and fits a model. \n#' @param predictor Function. Has 2 args (the fitted model, the_newdata) and produces predictions\n#' @param error_fun Function. Has one arg: the test data, with fits added.\n#' @param kfolds Integer. The number of folds.\nkfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {\n n <- nrow(data)\n fold_labels <- sample(rep(1:kfolds, length.out = n))\n errors <- double(kfolds)\n for (fold in seq_len(kfolds)) {\n test_rows <- fold_labels == fold\n train <- data[!test_rows, ]\n test <- data[test_rows, ]\n current_model <- estimator(train)\n test$.preds <- predictor(current_model, test)\n errors[fold] <- error_fun(test)\n }\n mean(errors)\n}\n\n\n\nsomedata <- data.frame(z = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))\nest <- function(dataset) lm(z ~ ., data = dataset)\npred <- function(mod, dataset) predict(mod, newdata = dataset)\nerror_fun <- function(testdata) mutate(testdata, errs = (z - .preds)^2) |> pull(errs) |> mean()\nkfold_cv(somedata, est, pred, error_fun, 5)\n\n[1] 0.9532271"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#why-this-class-1",
- "href": "schedule/slides/00-quiz-0-wrap.html#why-this-class-1",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#trick",
+ "href": "schedule/slides/05-estimating-test-mse.html#trick",
"title": "UBC Stat406 2023W",
- "section": "Why this class?",
- "text": "Why this class?\nMore idiosyncratic:\n\n\n“Professor received Phd from CMU, must be an awesome researcher.”\n“Learn strategies.”\n(paraphrase) “Course structure with less weight on exam helps with anxiety”\n(paraphrase) “I love coding in R and want more of it”\n“Emmmmmmmmmmmmmmmm, to learn some skills from Machine Learning and finish my minor🙃.”\n“destiny”\n“challenges from ChatGPT”\n“I thought Daniel Mcdonald is a cool prof…”\n“I have heard this is the most useful stat course in UBC.”"
+ "section": "Trick",
+ "text": "Trick\nFor a certain “nice” models, one can show\n(after pages of tedious algebra which I wouldn’t wish on my worst enemy, but might, in a fit of rage assign as homework to belligerent students)\n\\[\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.\\]\n\nThis trick means that you only have to fit the model once rather than \\(n\\) times!\nYou still have to calculate this for each model!\n\n\ncv_nice <- function(mdl) mean( (residuals(mdl) / (1 - hatvalues(mdl)))^2 )"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#syllabus-q",
- "href": "schedule/slides/00-quiz-0-wrap.html#syllabus-q",
+ "objectID": "schedule/slides/05-estimating-test-mse.html#trick-1",
+ "href": "schedule/slides/05-estimating-test-mse.html#trick-1",
"title": "UBC Stat406 2023W",
- "section": "Syllabus Q",
- "text": "Syllabus Q"
+ "section": "Trick",
+ "text": "Trick\n\ncv_nice <- function(mdl) mean( (residuals(mdl) / (1 - hatvalues(mdl)))^2 )\n\n“Nice” requires:\n\n\\(\\widehat{y}_i = h_i(\\mathbf{X})^\\top \\mathbf{y}\\) for some vector \\(h_i\\)\n\\(e^{(i)} = \\frac{\\widehat{e}_i}{(1-h_{ii})}\\)"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#programming-languages",
- "href": "schedule/slides/00-quiz-0-wrap.html#programming-languages",
+ "objectID": "schedule/slides/03-regression-function.html#meta-lecture",
+ "href": "schedule/slides/03-regression-function.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Programming languages",
- "text": "Programming languages"
+ "section": "03 The regression function",
+ "text": "03 The regression function\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#matrix-inversion",
- "href": "schedule/slides/00-quiz-0-wrap.html#matrix-inversion",
+ "objectID": "schedule/slides/03-regression-function.html#mean-squared-error-mse",
+ "href": "schedule/slides/03-regression-function.html#mean-squared-error-mse",
"title": "UBC Stat406 2023W",
- "section": "Matrix inversion",
- "text": "Matrix inversion\n\nlibrary(MASS)\nX <- matrix(c(5, 3, 1, -1), nrow = 2)\nX\n\n [,1] [,2]\n[1,] 5 1\n[2,] 3 -1\n\nsolve(X)\n\n [,1] [,2]\n[1,] 0.125 0.125\n[2,] 0.375 -0.625\n\nginv(X)\n\n [,1] [,2]\n[1,] 0.125 0.125\n[2,] 0.375 -0.625\n\nX^(-1)\n\n [,1] [,2]\n[1,] 0.2000000 1\n[2,] 0.3333333 -1"
+ "section": "Mean squared error (MSE)",
+ "text": "Mean squared error (MSE)\nLast time… Ordinary Least Squares\n\\[\\widehat\\beta = \\argmin_\\beta \\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n“Find the \\(\\beta\\) which minimizes the sum of squared errors.”\n\\[\\widehat\\beta = \\arg\\min_\\beta \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n“Find the beta which minimizes the mean squared error.”"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#linear-models",
- "href": "schedule/slides/00-quiz-0-wrap.html#linear-models",
+ "objectID": "schedule/slides/03-regression-function.html#forget-all-that",
+ "href": "schedule/slides/03-regression-function.html#forget-all-that",
"title": "UBC Stat406 2023W",
- "section": "Linear models",
- "text": "Linear models\n\ny <- X %*% c(2, -1) + rnorm(2)\ncoefficients(lm(y ~ X))\n\n(Intercept) X1 X2 \n 4.8953718 0.9380314 NA \n\ncoef(lm(y ~ X))\n\n(Intercept) X1 X2 \n 4.8953718 0.9380314 NA \n\nsolve(t(X) %*% X) %*% t(X) %*% y\n\n [,1]\n[1,] 2.161874\n[2,] -1.223843\n\nsolve(crossprod(X), crossprod(X, y))\n\n [,1]\n[1,] 2.161874\n[2,] -1.223843\n\n\n\nX \\ y # this is Matlab\n\nError: <text>:1:3: unexpected '\\\\'\n1: X \\\n ^"
+ "section": "Forget all that…",
+ "text": "Forget all that…\nThat’s “stuff that seems like a good idea”\nAnd it is for many reasons\nThis class is about those reasons, and the “statistics” behind it\n\n\n\nMethods for “Statistical” Learning\nStarts with “what is a model?”"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#pets-and-plans",
- "href": "schedule/slides/00-quiz-0-wrap.html#pets-and-plans",
+ "objectID": "schedule/slides/03-regression-function.html#what-is-a-model",
+ "href": "schedule/slides/03-regression-function.html#what-is-a-model",
"title": "UBC Stat406 2023W",
- "section": "Pets and plans",
- "text": "Pets and plans"
+ "section": "What is a model?",
+ "text": "What is a model?\nIn statistics, “model” has a mathematical meaning.\nDistinct from “algorithm” or “procedure”.\nDefining a model often leads to a procedure/algorithm with good properties.\nSometimes procedure/algorithm \\(\\Rightarrow\\) a specific model.\n\nStatistics (the field) tells me how to understand when different procedures are desirable and the mathematical guarantees that they satisfy.\n\nWhen are certain models appropriate?\n\nOne definition of “Statistical Learning” is the “statistics behind the procedure”."
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#grade-predictions",
- "href": "schedule/slides/00-quiz-0-wrap.html#grade-predictions",
+ "objectID": "schedule/slides/03-regression-function.html#statistical-models-101",
+ "href": "schedule/slides/03-regression-function.html#statistical-models-101",
"title": "UBC Stat406 2023W",
- "section": "Grade predictions",
- "text": "Grade predictions\n\n\n4 people say 100%\n24 say 90%\n25 say 85%\n27 say 80%\nLots of clumping\n\n\n1 said 35, and 1 said 50. Woof!"
+ "section": "Statistical models 101",
+ "text": "Statistical models 101\nWe observe data \\(Z_1,\\ Z_2,\\ \\ldots,\\ Z_n\\) generated by some probability distribution \\(P\\). We want to use the data to learn about \\(P\\).\n\nA statistical model is a set of distributions \\(\\mathcal{P}\\).\n\nSome examples:\n\n\\(\\P = \\{ 0 < p < 1 : P(z=1)=p,\\ P(z=0)=1-p\\}\\).\n\\(\\P = \\{ \\beta \\in \\R^p,\\ \\sigma>0 : Y \\sim N(X^\\top\\beta,\\sigma^2),\\ X\\mbox{ fixed}\\}\\).\n\\(\\P = \\{\\mbox{all CDF's }F\\}\\).\n\\(\\P = \\{\\mbox{all smooth functions } f: \\R^p \\rightarrow \\R : Z_i = (X_i, Y_i),\\ E[Y_i] = f(X_i) \\}\\)"
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year",
- "href": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year",
+ "objectID": "schedule/slides/03-regression-function.html#statistical-models",
+ "href": "schedule/slides/03-regression-function.html#statistical-models",
"title": "UBC Stat406 2023W",
- "section": "Prediction accuracy (last year)",
- "text": "Prediction accuracy (last year)"
+ "section": "Statistical models",
+ "text": "Statistical models\nWe want to use the data to select a distribution \\(P\\) that probably generated the data.\n\nMy model:\n\\[\n\\P = \\{ P(z=1)=p,\\ P(z=0)=1-p,\\ 0 < p < 1 \\}\n\\]\n\nTo completely characterize \\(P\\), I just need to estimate \\(p\\).\nNeed to assume that \\(P \\in \\P\\).\nThis assumption is mostly empty: need independent, can’t see \\(z=12\\)."
},
{
- "objectID": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year-1",
- "href": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year-1",
+ "objectID": "schedule/slides/03-regression-function.html#statistical-models-1",
+ "href": "schedule/slides/03-regression-function.html#statistical-models-1",
"title": "UBC Stat406 2023W",
- "section": "Prediction accuracy (last year)",
- "text": "Prediction accuracy (last year)\n\nsummary(lm(actual ~ predicted - 1, data = acc))\n\n\nCall:\nlm(formula = actual ~ predicted - 1, data = acc)\n\nResiduals:\n Min 1Q Median 3Q Max \n-63.931 -2.931 1.916 6.052 21.217 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \npredicted 0.96590 0.01025 94.23 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 10.2 on 137 degrees of freedom\n (8 observations deleted due to missingness)\nMultiple R-squared: 0.9848, Adjusted R-squared: 0.9847 \nF-statistic: 8880 on 1 and 137 DF, p-value: < 2.2e-16\n\n\n\n\nUBC Stat 406 - 2023"
+ "section": "Statistical models",
+ "text": "Statistical models\nWe observe data \\(Z_i=(Y_i,X_i)\\) generated by some probability distribution \\(P\\). We want to use the data to learn about \\(P\\).\n\nMy model\n\\[\n\\P = \\{ \\beta \\in \\R^p, \\sigma>0 : Y_i \\given X_i=x_i \\sim N(x_i^\\top\\beta,\\ \\sigma^2) \\}.\n\\]\n\nTo completely characterize \\(P\\), I just need to estimate \\(\\beta\\) and \\(\\sigma\\).\nNeed to assume that \\(P\\in\\P\\).\nThis time, I have to assume a lot more: (conditional) Linearity, independence, conditional Gaussian noise, no ignored variables, no collinearity, etc."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#meta-lecture",
- "href": "schedule/slides/00-gradient-descent.html#meta-lecture",
+ "objectID": "schedule/slides/03-regression-function.html#statistical-models-unfamiliar-example",
+ "href": "schedule/slides/03-regression-function.html#statistical-models-unfamiliar-example",
"title": "UBC Stat406 2023W",
- "section": "00 Gradient descent",
- "text": "00 Gradient descent\nStat 406\nDaniel J. McDonald\nLast modified – 25 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Statistical models, unfamiliar example",
+ "text": "Statistical models, unfamiliar example\nWe observe data \\(Z_i \\in \\R\\) generated by some probability distribution \\(P\\). We want to use the data to learn about \\(P\\).\nMy model\n\\[\n\\P = \\{ Z_i \\textrm{ has a density function } f \\}.\n\\]\n\nTo completely characterize \\(P\\), I need to estimate \\(f\\).\nIn fact, we can’t hope to do this.\n\nRevised Model 1 - \\(\\P=\\{ Z_i \\textrm{ has a density function } f : \\int (f'')^2 dx < M \\}\\)\nRevised Model 2 - \\(\\P=\\{ Z_i \\textrm{ has a density function } f : \\int (f'')^2 dx < K < M \\}\\)\nRevised Model 3 - \\(\\P=\\{ Z_i \\textrm{ has a density function } f : \\int |f'| dx < M \\}\\)\n\nEach of these suggests different ways of estimating \\(f\\)"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#simple-optimization-techniques",
- "href": "schedule/slides/00-gradient-descent.html#simple-optimization-techniques",
+ "objectID": "schedule/slides/03-regression-function.html#assumption-lean-regression",
+ "href": "schedule/slides/03-regression-function.html#assumption-lean-regression",
"title": "UBC Stat406 2023W",
- "section": "Simple optimization techniques",
- "text": "Simple optimization techniques\nWe’ll see “gradient descent” a few times:\n\nsolves logistic regression (simple version of IRWLS)\ngradient boosting\nNeural networks\n\nThis seems like a good time to explain it.\nSo what is it and how does it work?"
+ "section": "Assumption Lean Regression",
+ "text": "Assumption Lean Regression\nImagine \\(Z = (Y, \\mathbf{X}) \\sim P\\) with \\(Y \\in \\R\\) and \\(\\mathbf{X} = (1, X_1, \\ldots, X_p)^\\top\\).\nWe are interested in the conditional distribution \\(P_{Y|\\mathbf{X}}\\)\nSuppose we think that there is some function of interest which relates \\(Y\\) and \\(X\\).\nLet’s call this function \\(\\mu(\\mathbf{X})\\) for the moment. How do we estimate \\(\\mu\\)? What is \\(\\mu\\)?\n\n\nTo make this precise, we\n\nHave a model \\(\\P\\).\nNeed to define a “good” functional \\(\\mu\\).\nLet’s loosely define “good” as\n\n\nGiven a new (random) \\(Z\\), \\(\\mu(\\mathbf{X})\\) is “close” to \\(Y\\).\n\n\n\nSee Berk et al. Assumption Lean Regression."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#very-basic-example",
- "href": "schedule/slides/00-gradient-descent.html#very-basic-example",
+ "objectID": "schedule/slides/03-regression-function.html#evaluating-close",
+ "href": "schedule/slides/03-regression-function.html#evaluating-close",
"title": "UBC Stat406 2023W",
- "section": "Very basic example",
- "text": "Very basic example\n\n\nSuppose I want to minimize \\(f(x)=(x-6)^2\\) numerically.\nI start at a point (say \\(x_1=23\\))\nI want to “go” in the negative direction of the gradient.\nThe gradient (at \\(x_1=23\\)) is \\(f'(23)=2(23-6)=34\\).\nMove current value toward current value - 34.\n\\(x_2 = x_1 - \\gamma 34\\), for \\(\\gamma\\) small.\nIn general, \\(x_{n+1} = x_n -\\gamma f'(x_n)\\).\n\nniter <- 10\ngam <- 0.1\nx <- double(niter)\nx[1] <- 23\ngrad <- function(x) 2 * (x - 6)\nfor (i in 2:niter) x[i] <- x[i - 1] - gam * grad(x[i - 1])"
+ "section": "Evaluating “close”",
+ "text": "Evaluating “close”\nWe need more functions.\nChoose some loss function \\(\\ell\\) that measures how close \\(\\mu\\) and \\(Y\\) are.\n\n\n\nSquared-error:\n\\(\\ell(y,\\ \\mu) = (y-\\mu)^2\\)\nAbsolute-error:\n\\(\\ell(y,\\ \\mu) = |y-\\mu|\\)\nZero-One:\n\\(\\ell(y,\\ \\mu) = I(y\\neq\\mu)=\\begin{cases} 0 & y=\\mu\\\\1 & \\mbox{else}\\end{cases}\\)\nCauchy:\n\\(\\ell(y,\\ \\mu) = \\log(1 + (y - \\mu)^2)\\)\n\n\n\n\n\nCode\nggplot() +\n xlim(-2, 2) +\n geom_function(fun = ~log(1+.x^2), colour = 'purple', linewidth = 2) +\n geom_function(fun = ~.x^2, colour = tertiary, linewidth = 2) +\n geom_function(fun = ~abs(.x), colour = primary, linewidth = 2) +\n geom_line(\n data = tibble(x = seq(-2, 2, length.out = 100), y = as.numeric(x != 0)), \n aes(x, y), colour = orange, linewidth = 2) +\n geom_point(data = tibble(x = 0, y = 0), aes(x, y), \n colour = orange, pch = 16, size = 3) +\n ylab(bquote(\"\\u2113\" * (y - mu))) + xlab(bquote(y - mu))"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#why-does-this-work",
- "href": "schedule/slides/00-gradient-descent.html#why-does-this-work",
+ "objectID": "schedule/slides/03-regression-function.html#start-with-expected-squared-error",
+ "href": "schedule/slides/03-regression-function.html#start-with-expected-squared-error",
"title": "UBC Stat406 2023W",
- "section": "Why does this work?",
- "text": "Why does this work?\nHeuristic interpretation:\n\nGradient tells me the slope.\nnegative gradient points toward the minimum\ngo that way, but not too far (or we’ll miss it)"
+ "section": "Start with (Expected) Squared Error",
+ "text": "Start with (Expected) Squared Error\nLet’s try to minimize the expected squared error (MSE).\nClaim: \\(\\mu(X) = \\Expect{Y\\ \\vert\\ X}\\) minimizes MSE.\nThat is, for any \\(r(X)\\), \\(\\Expect{(Y - \\mu(X))^2} \\leq \\Expect{(Y-r(X))^2}\\).\n\nProof of Claim:\n\\[\\begin{aligned}\n\\Expect{(Y-r(X))^2}\n&= \\Expect{(Y- \\mu(X) + \\mu(X) - r(X))^2}\\\\\n&= \\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2\\Expect{(Y- \\mu(X))(\\mu(X) - r(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2(\\mu(X) - r(X))\\Expect{(Y- \\mu(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} + 0\\\\\n&\\geq \\Expect{(Y- \\mu(X))^2}\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#why-does-this-work-1",
- "href": "schedule/slides/00-gradient-descent.html#why-does-this-work-1",
+ "objectID": "schedule/slides/03-regression-function.html#the-regression-function",
+ "href": "schedule/slides/03-regression-function.html#the-regression-function",
"title": "UBC Stat406 2023W",
- "section": "Why does this work?",
- "text": "Why does this work?\nMore rigorous interpretation:\n\nTaylor expansion \\[\nf(x) \\approx f(x_0) + \\nabla f(x_0)^{\\top}(x-x_0) + \\frac{1}{2}(x-x_0)^\\top H(x_0) (x-x_0)\n\\]\nreplace \\(H\\) with \\(\\gamma^{-1} I\\)\nminimize this quadratic approximation in \\(x\\): \\[\n0\\overset{\\textrm{set}}{=}\\nabla f(x_0) + \\frac{1}{\\gamma}(x-x_0) \\Longrightarrow x = x_0 - \\gamma \\nabla f(x_0)\n\\]"
+ "section": "The regression function",
+ "text": "The regression function\nSometimes people call this solution:\n\\[\\mu(X) = \\Expect{Y \\ \\vert\\ X}\\]\nthe regression function. (But don’t forget that it depended on \\(\\ell\\).)\nIf we assume that \\(\\mu(x) = \\Expect{Y \\ \\vert\\ X=x} = x^\\top \\beta\\), then we get back exactly OLS.\n\nBut why should we assume \\(\\mu(x) = x^\\top \\beta\\)?"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#visually",
- "href": "schedule/slides/00-gradient-descent.html#visually",
+ "objectID": "schedule/slides/03-regression-function.html#brief-aside",
+ "href": "schedule/slides/03-regression-function.html#brief-aside",
"title": "UBC Stat406 2023W",
- "section": "Visually",
- "text": "Visually"
+ "section": "Brief aside",
+ "text": "Brief aside\nSome notation / terminology\n\n“Hats” on things mean “estimates”, so \\(\\widehat{\\mu}\\) is an estimate of \\(\\mu\\)\nParameters are “properties of the model”, so \\(f_X(x)\\) or \\(\\mu\\) or \\(\\Var{Y}\\)\nRandom variables like \\(X\\), \\(Y\\), \\(Z\\) may eventually become data, \\(x\\), \\(y\\), \\(z\\), once observed.\n“Estimating” means “using observations to estimate parameters”\n“Predicting” means “using observations to predict future data”\nOften, there is a parameter whose estimate will provide a prediction.\n\n\n\nThis last point can lead to confusion."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#visually-1",
- "href": "schedule/slides/00-gradient-descent.html#visually-1",
+ "objectID": "schedule/slides/03-regression-function.html#the-regression-function-1",
+ "href": "schedule/slides/03-regression-function.html#the-regression-function-1",
"title": "UBC Stat406 2023W",
- "section": "Visually",
- "text": "Visually"
+ "section": "The regression function",
+ "text": "The regression function\nIn mathematics: \\(\\mu(x) = \\Expect{Y \\ \\vert\\ X=x}\\).\nIn words:\nRegression with squared-error loss is really about estimating the (conditional) mean.\n\nIf \\(Y\\sim \\textrm{N}(\\mu,\\ 1)\\), our best guess for a new \\(Y\\) is \\(\\mu\\).\nFor regression, we let the mean \\((\\mu)\\) depend on \\(X\\).\n\nThink of \\(Y\\sim \\textrm{N}(\\mu(X),\\ 1)\\), then conditional on \\(X=x\\), our best guess for a new \\(Y\\) is \\(\\mu(x)\\)\n\n[whatever this function \\(\\mu\\) is]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#what-gamma-more-details-than-we-have-time-for",
- "href": "schedule/slides/00-gradient-descent.html#what-gamma-more-details-than-we-have-time-for",
+ "objectID": "schedule/slides/03-regression-function.html#anything-strange",
+ "href": "schedule/slides/03-regression-function.html#anything-strange",
"title": "UBC Stat406 2023W",
- "section": "What \\(\\gamma\\)? (more details than we have time for)",
- "text": "What \\(\\gamma\\)? (more details than we have time for)\nWhat to use for \\(\\gamma_k\\)?\nFixed\n\nOnly works if \\(\\gamma\\) is exactly right\nUsually does not work\n\nDecay on a schedule\n\\(\\gamma_{n+1} = \\frac{\\gamma_n}{1+cn}\\) or \\(\\gamma_{n} = \\gamma_0 b^n\\)\nExact line search\n\nTells you exactly how far to go.\nAt each iteration \\(n\\), solve \\(\\gamma_n = \\arg\\min_{s \\geq 0} f( x^{(n)} - s f(x^{(n-1)}))\\)\nUsually can’t solve this."
+ "section": "Anything strange?",
+ "text": "Anything strange?\nFor any two variables \\(Y\\) and \\(X\\), we can always write\n\\[Y = E[Y\\given X] + (Y - E[Y\\given X]) = \\mu(X) + \\eta(X)\\]\nsuch that \\(\\Expect{\\eta(X)}=0\\).\n\n\nSuppose, \\(\\mu(X)=\\mu_0\\) (constant in \\(X\\)), are \\(Y\\) and \\(X\\) independent?\n\n\n\n\nSuppose \\(Y\\) and \\(X\\) are independent, is \\(\\mu(X)=\\mu_0\\)?\n\n\n\n\nFor more practice on this see the Fun Worksheet on Theory and solutions\nIn this course, I do not expect you to be able to create this math, but understanding and explaining it is important."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#section",
- "href": "schedule/slides/00-gradient-descent.html#section",
+ "objectID": "schedule/slides/03-regression-function.html#what-do-we-mean-by-good-predictions",
+ "href": "schedule/slides/03-regression-function.html#what-do-we-mean-by-good-predictions",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\nx <- matrix(0, 40, 2); x[1, ] <- c(1, 1)\ngrad <- function(x) c(2, 1) * x"
+ "section": "What do we mean by good predictions?",
+ "text": "What do we mean by good predictions?\nWe make observations and then attempt to “predict” new, unobserved data.\nSometimes this is the same as estimating the (conditional) mean.\nMostly, we observe \\((y_1,x_1),\\ \\ldots,\\ (y_n,x_n)\\), and we want some way to predict \\(Y\\) from \\(X\\)."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#section-1",
- "href": "schedule/slides/00-gradient-descent.html#section-1",
+ "objectID": "schedule/slides/03-regression-function.html#expected-test-mse",
+ "href": "schedule/slides/03-regression-function.html#expected-test-mse",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .1\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])"
+ "section": "Expected test MSE",
+ "text": "Expected test MSE\nFor regression applications, we will use squared-error loss:\n\\(R_n(\\widehat{\\mu}) = \\Expect{(Y-\\widehat{\\mu}(X))^2}\\)\n\nI’m giving this a name, \\(R_n\\) for ease.\nDifferent than text.\nThis is expected test MSE."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#section-2",
- "href": "schedule/slides/00-gradient-descent.html#section-2",
+ "objectID": "schedule/slides/03-regression-function.html#example-estimatingpredicting-the-conditional-mean",
+ "href": "schedule/slides/03-regression-function.html#example-estimatingpredicting-the-conditional-mean",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .9 # bigger gamma\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])"
+ "section": "Example: Estimating/Predicting the (conditional) mean",
+ "text": "Example: Estimating/Predicting the (conditional) mean\nSuppose we know that we want to predict a quantity \\(Y\\),\nwhere \\(\\Expect{Y}= \\mu \\in \\mathbb{R}\\) and \\(\\Var{Y} = 1\\).\nOur data is \\(\\{y_1,\\ldots,y_n\\}\\)\nClaim: We want to estimate \\(\\mu\\).\n\nWhy?"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#section-3",
- "href": "schedule/slides/00-gradient-descent.html#section-3",
+ "objectID": "schedule/slides/03-regression-function.html#estimating-the-mean",
+ "href": "schedule/slides/03-regression-function.html#estimating-the-mean",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .9 # big, but decrease it on schedule\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * .9^k * grad(x[k - 1, ])"
+ "section": "Estimating the mean",
+ "text": "Estimating the mean\n\nLet \\(\\widehat{Y}=\\overline{Y}_n\\) be the sample mean.\n\nWe can ask about the estimation risk (since we’re estimating \\(\\mu\\)):\n\n\n\n\\[\\begin{aligned}\n E[(\\overline{Y}_n-\\mu)^2]\n &= E[\\overline{Y}_n^2]\n -2\\mu E[\\overline{Y}_n] + \\mu^2 \\\\\n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 +\n \\mu^2\\\\ &= \\frac{1}{n}\n\\end{aligned}\\]\n\n\nUseful trick\nFor any \\(Z\\),\n\\(\\Var{Z} = \\Expect{Z^2} - \\Expect{Z}^2\\).\nTherefore:\n\\(\\Expect{Z^2} = \\Var{Z} + \\Expect{Z}^2\\)."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#section-4",
- "href": "schedule/slides/00-gradient-descent.html#section-4",
+ "objectID": "schedule/slides/03-regression-function.html#predicting-new-ys",
+ "href": "schedule/slides/03-regression-function.html#predicting-new-ys",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .5 # theoretically optimal\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])"
+ "section": "Predicting new Y’s",
+ "text": "Predicting new Y’s\n\nLet \\(\\widehat{Y}=\\overline{Y}_n\\) be the sample mean.\n\nWhat is the prediction risk of \\(\\overline{Y}\\)?\n\n\n\n\\[\\begin{aligned}\n R_n(\\overline{Y}_n)\n &= \\E[(\\overline{Y}_n-Y)^2]\\\\\n &= \\E[\\overline{Y}_{n}^{2}] -2\\E[\\overline{Y}_n Y] + \\E[Y^2] \\\\\n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 + \\mu^2 + 1 \\\\\n &= 1 + \\frac{1}{n}\n\\end{aligned}\\]\n\n\nTricks:\nUsed the variance thing again.\nIf \\(X\\) and \\(Z\\) are independent, then \\(\\Expect{XZ} = \\Expect{X}\\Expect{Z}\\)"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#when-do-we-stop",
- "href": "schedule/slides/00-gradient-descent.html#when-do-we-stop",
+ "objectID": "schedule/slides/03-regression-function.html#predicting-new-ys-1",
+ "href": "schedule/slides/03-regression-function.html#predicting-new-ys-1",
"title": "UBC Stat406 2023W",
- "section": "When do we stop?",
- "text": "When do we stop?\nFor \\(\\epsilon>0\\), small\nCheck any / all of\n\n\\(|f'(x)| < \\epsilon\\)\n\\(|x^{(k)} - x^{(k-1)}| < \\epsilon\\)\n\\(|f(x^{(k)}) - f(x^{(k-1)})| < \\epsilon\\)"
+ "section": "Predicting new Y’s",
+ "text": "Predicting new Y’s\n\nWhat is the prediction risk of guessing \\(Y=0\\)?\nYou can probably guess that this is a stupid idea.\nLet’s show why it’s stupid.\n\n\\[\\begin{aligned}\n R_n(0) &= \\E[(0-Y)^2] = 1 + \\mu^2\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#stochastic-gradient-descent",
- "href": "schedule/slides/00-gradient-descent.html#stochastic-gradient-descent",
+ "objectID": "schedule/slides/03-regression-function.html#predicting-new-ys-2",
+ "href": "schedule/slides/03-regression-function.html#predicting-new-ys-2",
"title": "UBC Stat406 2023W",
- "section": "Stochastic gradient descent",
- "text": "Stochastic gradient descent\nSuppose \\(f(x) = \\frac{1}{n}\\sum_{i=1}^n f_i(x)\\)\nLike if \\(f(\\beta) = \\frac{1}{n}\\sum_{i=1}^n (y_i - x^\\top_i\\beta)^2\\).\nThen \\(f'(\\beta) = \\frac{1}{n}\\sum_{i=1}^n f'_i(\\beta) = \\frac{1}{n} \\sum_{i=1}^n -2x_i^\\top(y_i - x^\\top_i\\beta)\\)\nIf \\(n\\) is really big, it may take a long time to compute \\(f'\\)\nSo, just sample some partition our data into mini-batches \\(\\mathcal{M}_j\\)\nAnd approximate (imagine the Law of Large Numbers, use a sample to approximate the population) \\[f'(x) = \\frac{1}{n}\\sum_{i=1}^n f'_i(x) \\approx \\frac{1}{m}\\sum_{i\\in\\mathcal{M}_j}f'_{i}(x)\\]"
+ "section": "Predicting new Y’s",
+ "text": "Predicting new Y’s\n\nWhat is the prediction risk of guessing \\(Y=\\mu\\)?\nThis is a great idea, but we don’t know \\(\\mu\\).\nLet’s see what happens anyway.\n\n\\[\\begin{aligned}\n R_n(\\mu) &= \\E[(Y-\\mu)^2]= 1\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#sgd",
- "href": "schedule/slides/00-gradient-descent.html#sgd",
+ "objectID": "schedule/slides/03-regression-function.html#risk-relations",
+ "href": "schedule/slides/03-regression-function.html#risk-relations",
"title": "UBC Stat406 2023W",
- "section": "SGD",
- "text": "SGD\n\\[\n\\begin{aligned}\nf'(\\beta) &= \\frac{1}{n}\\sum_{i=1}^n f'_i(\\beta) = \\frac{1}{n} \\sum_{i=1}^n -2x_i^\\top(y_i - x^\\top_i\\beta)\\\\\nf'(x) &= \\frac{1}{n}\\sum_{i=1}^n f'_i(x) \\approx \\frac{1}{m}\\sum_{i\\in\\mathcal{M}_j}f'_{i}(x)\n\\end{aligned}\n\\]\nUsually cycle through “mini-batches”:\n\nUse a different mini-batch at each iteration of GD\nCycle through until we see all the data\n\nThis is the workhorse for neural network optimization"
+ "section": "Risk relations",
+ "text": "Risk relations\nPrediction risk: \\(R_n(\\overline{Y}_n) = 1 + \\frac{1}{n}\\)\nEstimation risk: \\(E[(\\overline{Y}_n - \\mu)^2] = \\frac{1}{n}\\)\nThere is actually a nice interpretation here:\n\nThe common \\(1/n\\) term is \\(\\Var{\\overline{Y}_n}\\)\n\nThe extra factor of \\(1\\) in the prediction risk is irreducible error\n\n\\(Y\\) is a random variable, and hence noisy.\nWe can never eliminate it’s intrinsic variance.\n\nIn other words, even if we knew \\(\\mu\\), we could never get closer than \\(1\\), on average.\n\n\nIntuitively, \\(\\overline{Y}_n\\) is the obvious thing to do.\nBut what about unintuitive things…"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#gradient-descent-for-logistic-regression",
- "href": "schedule/slides/00-gradient-descent.html#gradient-descent-for-logistic-regression",
+ "objectID": "schedule/slides/01-lm-review.html#meta-lecture",
+ "href": "schedule/slides/01-lm-review.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Gradient descent for Logistic regression",
- "text": "Gradient descent for Logistic regression\nSuppose \\(Y=1\\) with probability \\(p(x)\\) and \\(Y=0\\) with probability \\(1-p(x)\\), \\(x \\in \\R\\).\nI want to model \\(P(Y=1| X=x)\\).\nI’ll assume that \\(\\log\\left(\\frac{p(x)}{1-p(x)}\\right) = ax\\) for some scalar \\(a\\). This means that \\(p(x) = \\frac{\\exp(ax)}{1+\\exp(ax)} = \\frac{1}{1+\\exp(-ax)}\\)\n\n\n\nn <- 100\na <- 2\nx <- runif(n, -5, 5)\nlogit <- function(x) 1 / (1 + exp(-x))\np <- logit(a * x)\ny <- rbinom(n, 1, p)\ndf <- tibble(x, y)\nggplot(df, aes(x, y)) +\n geom_point(colour = \"cornflowerblue\") +\n stat_function(fun = ~ logit(a * .x))"
+ "section": "01 Linear model review",
+ "text": "01 Linear model review\nStat 406\nDaniel J. McDonald\nLast modified – 30 August 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood",
- "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood",
+ "objectID": "schedule/slides/01-lm-review.html#the-normal-linear-model",
+ "href": "schedule/slides/01-lm-review.html#the-normal-linear-model",
"title": "UBC Stat406 2023W",
- "section": "Reminder: the likelihood",
- "text": "Reminder: the likelihood\n\\[\nL(y | a, x) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\n\\[\n\\begin{aligned}\n\\ell(y | a, x) &= \\log \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\n= \\sum_{i=1}^n y_i\\log p(x_i) + (1-y_i)\\log(1-p(x_i))\\\\\n&= \\sum_{i=1}^n\\log(1-p(x_i)) + y_i\\log\\left(\\frac{p(x_i)}{1-p(x_i)}\\right)\\\\\n&=\\sum_{i=1}^n ax_i y_i + \\log\\left(1-p(x_i)\\right)\\\\\n&=\\sum_{i=1}^n ax_i y_i + \\log\\left(\\frac{1}{1+\\exp(ax_i)}\\right)\n\\end{aligned}\n\\]"
+ "section": "The normal linear model",
+ "text": "The normal linear model\nAssume that\n\\[\ny_i = x_i^\\top \\beta + \\epsilon_i.\n\\]\n\n\nWhat is the mean of \\(y_i\\)?\nWhat is the distribution of \\(\\epsilon_i\\)?\nWhat is the notation \\(\\mathbf{X}\\) or \\(\\mathbf{y}\\)?"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-1",
- "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-1",
+ "objectID": "schedule/slides/01-lm-review.html#drawing-a-sample",
+ "href": "schedule/slides/01-lm-review.html#drawing-a-sample",
"title": "UBC Stat406 2023W",
- "section": "Reminder: the likelihood",
- "text": "Reminder: the likelihood\n\\[\nL(y | a, x) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\nNow, we want the negative of this. Why?\nWe would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by \\(1/n\\))\n\\[-\\ell(y | a, x) = \\frac{1}{n}\\sum_{i=1}^n -ax_i y_i - \\log\\left(\\frac{1}{1+\\exp(ax_i)}\\right)\\]"
+ "section": "Drawing a sample",
+ "text": "Drawing a sample\n\\[\ny_i = x_i^\\top \\beta + \\epsilon_i.\n\\]\nHow would I create data from this model (draw a sample)?\n\nSet up parameters\n\np <- 3\nn <- 100\nsigma <- 2\n\n\n\nCreate the data\n\nepsilon <- rnorm(n, sd = sigma) # this is random\nX <- matrix(runif(n * p), n, p) # treat this as fixed, but I need numbers\nbeta <- (p + 1):1 # parameter, also fixed, but I again need numbers\nY <- cbind(1, X) %*% beta + epsilon # epsilon is random, so this is\n## Equiv: Y <- beta[1] + X %*% beta[-1] + epsilon"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-2",
- "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-2",
+ "objectID": "schedule/slides/01-lm-review.html#how-do-we-estimate-beta",
+ "href": "schedule/slides/01-lm-review.html#how-do-we-estimate-beta",
"title": "UBC Stat406 2023W",
- "section": "Reminder: the likelihood",
- "text": "Reminder: the likelihood\n\\[\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\nThis is, in the notation of our slides \\(f(a)\\).\nWe want to minimize it in \\(a\\) by gradient descent.\nSo we need the derivative with respect to \\(a\\): \\(f'(a)\\).\nNow, conveniently, this simplifies a lot.\n\\[\n\\begin{aligned}\n\\frac{d}{d a} f(a) &= \\frac{1}{n}\\sum_{i=1}^n -x_i y_i - \\left(-\\frac{x_i \\exp(ax_i)}{1+\\exp(ax_i)}\\right)\\\\\n&=\\frac{1}{n}\\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \\frac{1}{n}\\sum_{i=1}^n -x_i(y_i-p(x_i)).\n\\end{aligned}\n\\]"
+ "section": "How do we estimate beta?",
+ "text": "How do we estimate beta?\n\nGuess.\nOrdinary least squares (OLS).\nMaximum likelihood.\nDo something more creative."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-3",
- "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-3",
+ "objectID": "schedule/slides/01-lm-review.html#method-2.-ols",
+ "href": "schedule/slides/01-lm-review.html#method-2.-ols",
"title": "UBC Stat406 2023W",
- "section": "Reminder: the likelihood",
- "text": "Reminder: the likelihood\n\\[\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\n(Simple) gradient descent to minimize \\(-\\ell(a)\\) or maximize \\(L(y|a,x)\\) is:\n\nInput \\(a_1,\\ \\gamma>0,\\ j_\\max,\\ \\epsilon>0,\\ \\frac{d}{da} -\\ell(a)\\).\nFor \\(j=1,\\ 2,\\ \\ldots,\\ j_\\max\\), \\[a_j = a_{j-1} - \\gamma \\frac{d}{da} (-\\ell(a_{j-1}))\\]\nStop if \\(\\epsilon > |a_j - a_{j-1}|\\) or \\(|d / da\\ \\ell(a)| < \\epsilon\\)."
+ "section": "Method 2. OLS",
+ "text": "Method 2. OLS\nI want to find an estimator \\(\\widehat\\beta\\) that makes small errors on my data.\nI measure errors with the difference between predictions \\(\\mathbf{X}\\widehat\\beta\\) and the responses \\(\\mathbf{y}\\).\n\n\nDon’t care if the differences are positive or negative\n\\[\\sum_{i=1}^n \\left\\lvert y_i - x_i^\\top \\widehat\\beta \\right\\rvert.\\]\nThis is hard to minimize (what is the derivative of \\(|\\cdot|\\)?)\n\\[\\sum_{i=1}^n ( y_i - x_i^\\top \\widehat\\beta )^2.\\]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-4",
- "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-4",
+ "objectID": "schedule/slides/01-lm-review.html#method-2.-ols-solution",
+ "href": "schedule/slides/01-lm-review.html#method-2.-ols-solution",
"title": "UBC Stat406 2023W",
- "section": "Reminder: the likelihood",
- "text": "Reminder: the likelihood\n\\[\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\n\namle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {\n a <- double(jmax) # place to hold stuff (always preallocate space)\n a[1] <- a0 # starting value\n for (j in 2:jmax) { # avoid possibly infinite while loops\n px <- logit(a[j - 1] * x)\n grad <- mean(-x * (y - px))\n a[j] <- a[j - 1] - gam * grad\n if (abs(grad) < eps || abs(a[j] - a[j - 1]) < eps) break\n }\n a[1:j]\n}"
+ "section": "Method 2. OLS solution",
+ "text": "Method 2. OLS solution\nWe write this as\n\\[\\widehat\\beta = \\argmin_\\beta \\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n\nFind the \\(\\beta\\) which minimizes the sum of squared errors.\n\n\nNote that this is the same as\n\\[\\widehat\\beta = \\argmin_\\beta \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n\nFind the beta which minimizes the mean squared error."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#try-it",
- "href": "schedule/slides/00-gradient-descent.html#try-it",
+ "objectID": "schedule/slides/01-lm-review.html#method-2.-ok-do-it",
+ "href": "schedule/slides/01-lm-review.html#method-2.-ok-do-it",
"title": "UBC Stat406 2023W",
- "section": "Try it:",
- "text": "Try it:\n\nround(too_big <- amle(x, y, 5, 50), 3)\n\n [1] 5.000 3.360 2.019 1.815 2.059 1.782 2.113 1.746 2.180 1.711 2.250 1.684\n[13] 2.309 1.669 2.344 1.663 2.359 1.661 2.364 1.660 2.365 1.660 2.366 1.660\n[25] 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660\n[37] 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660\n[49] 2.366 1.660\n\nround(too_small <- amle(x, y, 5, 1), 3)\n\n [1] 5.000 4.967 4.934 4.902 4.869 4.837 4.804 4.772 4.739 4.707 4.675 4.643\n[13] 4.611 4.579 4.547 4.515 4.483 4.451 4.420 4.388 4.357 4.326 4.294 4.263\n[25] 4.232 4.201 4.170 4.140 4.109 4.078 4.048 4.018 3.988 3.957 3.927 3.898\n[37] 3.868 3.838 3.809 3.779 3.750 3.721 3.692 3.663 3.635 3.606 3.578 3.550\n[49] 3.522 3.494\n\nround(just_right <- amle(x, y, 5, 10), 3)\n\n [1] 5.000 4.672 4.351 4.038 3.735 3.445 3.171 2.917 2.688 2.488 2.322 2.191\n[13] 2.094 2.027 1.983 1.956 1.940 1.930 1.925 1.922 1.920 1.919 1.918 1.918\n[25] 1.918 1.918 1.918 1.917 1.917 1.917 1.917"
+ "section": "Method 2. Ok, do it",
+ "text": "Method 2. Ok, do it\nWe differentiate and set to zero\n\\[\\begin{aligned}\n& \\frac{\\partial}{\\partial \\beta} \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2\\\\\n&= -\\frac{2}{n}\\sum_{i=1}^n x_i (y_i - x_i^\\top\\beta)\\\\\n&= \\frac{2}{n}\\sum_{i=1}^n x_i x_i^\\top \\beta - x_i y_i\\\\\n0 &\\equiv \\sum_{i=1}^n x_i x_i^\\top \\beta - x_i y_i\\\\\n&\\Rightarrow \\sum_{i=1}^n x_i x_i^\\top \\beta = \\sum_{i=1}^n x_i y_i\\\\\n&\\Rightarrow \\beta = \\left(\\sum_{i=1}^n x_i x_i^\\top\\right)^{-1}\\sum_{i=1}^n x_i y_i\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#visual",
- "href": "schedule/slides/00-gradient-descent.html#visual",
+ "objectID": "schedule/slides/01-lm-review.html#in-matrix-notation",
+ "href": "schedule/slides/01-lm-review.html#in-matrix-notation",
"title": "UBC Stat406 2023W",
- "section": "Visual",
- "text": "Visual\n\n\nnegll <- function(a) {\n -a * mean(x * y) -\n rowMeans(log(1 / (1 + exp(outer(a, x)))))\n}\nblah <- list_rbind(\n map(\n rlang::dots_list(\n too_big, too_small, just_right, .named = TRUE\n ), \n as_tibble),\n names_to = \"gamma\"\n) |> mutate(negll = negll(value))\nggplot(blah, aes(value, negll)) +\n geom_point(aes(colour = gamma)) +\n facet_wrap(~gamma, ncol = 1) +\n stat_function(fun = negll, xlim = c(-2.5, 5)) +\n scale_y_log10() + \n xlab(\"a\") + \n ylab(\"negative log likelihood\") +\n geom_vline(xintercept = tail(just_right, 1)) +\n scale_colour_brewer(palette = \"Set1\") +\n theme(legend.position = \"none\")"
+ "section": "In matrix notation…",
+ "text": "In matrix notation…\n…this is\n\\[\\hat\\beta = ( \\mathbf{X}^\\top \\mathbf{X})^{-1} \\mathbf{X}^\\top\\mathbf{y}.\\]\nThe \\(\\beta\\) which “minimizes the sum of squared errors”\nAKA, the SSE."
},
{
- "objectID": "schedule/slides/00-gradient-descent.html#check-vs.-glm",
- "href": "schedule/slides/00-gradient-descent.html#check-vs.-glm",
+ "objectID": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood",
+ "href": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood",
"title": "UBC Stat406 2023W",
- "section": "Check vs. glm()",
- "text": "Check vs. glm()\n\nsummary(glm(y ~ x - 1, family = \"binomial\"))\n\n\nCall:\nglm(formula = y ~ x - 1, family = \"binomial\")\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \nx 1.9174 0.4785 4.008 6.13e-05 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 138.629 on 100 degrees of freedom\nResidual deviance: 32.335 on 99 degrees of freedom\nAIC: 34.335\n\nNumber of Fisher Scoring iterations: 7\n\n\n\n\nUBC Stat 406 - 2023"
+ "section": "Method 3: maximum likelihood",
+ "text": "Method 3: maximum likelihood\nMethod 2 didn’t use anything about the distribution of \\(\\epsilon\\).\nBut if we know that \\(\\epsilon\\) has a normal distribution, we can write down the joint distribution of \\(\\mathbf{y}=(y_1,\\ldots,y_n)^\\top\\):\n\\[\\begin{aligned}\nf_Y(\\mathbf{y} ; \\beta) &= \\prod_{i=1}^n f_{y_i ; \\beta}(y_i)\\\\\n &= \\prod_{i=1}^n \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\exp\\left(-\\frac{1}{2\\sigma^2} (y_i-x_i^\\top \\beta)^2\\right)\\\\\n &= \\left( \\frac{1}{2\\pi\\sigma^2}\\right)^{n/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\right)\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#meta-lecture",
- "href": "schedule/slides/00-classification-losses.html#meta-lecture",
+ "objectID": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood-1",
+ "href": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood-1",
"title": "UBC Stat406 2023W",
- "section": "00 Evaluating classifiers",
- "text": "00 Evaluating classifiers\nStat 406\nDaniel J. McDonald\nLast modified – 16 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Method 3: maximum likelihood",
+ "text": "Method 3: maximum likelihood\n\\[\nf_Y(\\mathbf{y} ; \\beta) = \\left( \\frac{1}{2\\pi\\sigma^2}\\right)^{n/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\right)\n\\]\nIn probability courses, we think of \\(f_Y\\) as a function of \\(\\mathbf{y}\\) with \\(\\beta\\) fixed:\n\nIf we integrate over \\(\\mathbf{y}\\), it’s \\(1\\).\nIf we want the probability of \\((a,b)\\), we integrate from \\(a\\) to \\(b\\).\netc."
},
{
- "objectID": "schedule/slides/00-classification-losses.html#how-do-we-measure-accuracy",
- "href": "schedule/slides/00-classification-losses.html#how-do-we-measure-accuracy",
+ "objectID": "schedule/slides/01-lm-review.html#turn-it-around",
+ "href": "schedule/slides/01-lm-review.html#turn-it-around",
"title": "UBC Stat406 2023W",
- "section": "How do we measure accuracy?",
- "text": "How do we measure accuracy?\nSo far — 0-1 loss. If correct class, lose 0 else lose 1.\nAsymmetric classification loss — If correct class, lose 0 else lose something.\nFor example, consider facial recognition. Goal is “person OK”, “person has expired passport”, “person is a known terrorist”\n\nIf classify OK, but was terrorist, lose 1,000,000\nIf classify OK, but expired passport, lose 2\nIf classify terrorist, but was OK, lose 100\nIf classify terrorist, but was expired passport, lose 10\netc.\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n [,1] [,2] [,3]\n[1,] 0 2 30\n[2,] 10 0 100\n[3,] 1000000 50000 0"
+ "section": "Turn it around…",
+ "text": "Turn it around…\n…instead, think of it as a function of \\(\\beta\\).\nWe call this “the likelihood” of beta: \\(\\mathcal{L}(\\beta)\\).\nGiven some data, we can evaluate the likelihood for any value of \\(\\beta\\) (assuming \\(\\sigma\\) is known).\nIt won’t integrate to 1 over \\(\\beta\\).\nBut it is “convex”,\nmeaning we can maximize it (the second derivative wrt \\(\\beta\\) is everywhere negative)."
},
{
- "objectID": "schedule/slides/00-classification-losses.html#deviance-loss",
- "href": "schedule/slides/00-classification-losses.html#deviance-loss",
+ "objectID": "schedule/slides/01-lm-review.html#so-lets-maximize",
+ "href": "schedule/slides/01-lm-review.html#so-lets-maximize",
"title": "UBC Stat406 2023W",
- "section": "Deviance loss",
- "text": "Deviance loss\nSometimes we output probabilities as well as class labels.\nFor example, logistic regression returns the probability that an observation is in class 1. \\(P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})\\)\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n(Trees “don’t”, neither does KNN, though you could fake it)\n\n\n\nDeviance loss for 2-class classification is \\(-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))\\)\n\n(Technically, it’s the difference between this and the loss of the null model, but people play fast and loose)\n\nCould also use cross entropy or Gini index."
+ "section": "So let’s maximize",
+ "text": "So let’s maximize\nThe derivative of this thing is kind of ugly.\nBut if we’re trying to maximize over \\(\\beta\\), we can take an increasing transformation without changing anything.\nI choose \\(\\log_e\\).\n\\[\\begin{aligned}\n\\mathcal{L}(\\beta) &= \\left( \\frac{1}{2\\pi\\sigma^2}\\right)^{n/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\right)\\\\\n\\ell(\\beta) &=-\\frac{n}{2}\\log (2\\pi\\sigma^2) -\\frac{1}{2\\sigma^2} \\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\n\\end{aligned}\\]\nBut we can ignore constants, so this gives\n\\[\\widehat\\beta = \\argmax_\\beta -\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\]\nThe same as before!"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#calibration",
- "href": "schedule/slides/00-classification-losses.html#calibration",
+ "objectID": "schedule/slides/00-r-review.html#meta-lecture",
+ "href": "schedule/slides/00-r-review.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Calibration",
- "text": "Calibration\nSuppose we predict some probabilities for our data, how often do those events happen?\nIn principle, if we predict \\(\\hat{p}(x_i)=0.2\\) for a bunch of events observations \\(i\\), we’d like to see about 20% 1 and 80% 0. (In training set and test set)\nThe same goes for the other probabilities. If we say “20% chance of rain” it should rain 20% of such days.\nOf course, we didn’t predict exactly \\(\\hat{p}(x_i)=0.2\\) ever, so lets look at \\([.15, .25]\\).\n\nn <- 250\ndat <- tibble(\n x = seq(-5, 5, length.out = n),\n p = 1 / (1 + exp(-x)),\n y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |>\n filter(phat > .15, phat < .25) |>\n summarize(target = .2, obs = mean(y))\n\n\n\n# A tibble: 1 × 2\n target obs\n <dbl> <dbl>\n1 0.2 0.222"
+ "section": "00 R, Rmarkdown, code, and {tidyverse}: A whirlwind tour",
+ "text": "00 R, Rmarkdown, code, and {tidyverse}: A whirlwind tour\nStat 406\nDaniel J. McDonald\nLast modified – 11 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#calibration-plot",
- "href": "schedule/slides/00-classification-losses.html#calibration-plot",
+ "objectID": "schedule/slides/00-r-review.html#tour-of-rstudio",
+ "href": "schedule/slides/00-r-review.html#tour-of-rstudio",
"title": "UBC Stat406 2023W",
- "section": "Calibration plot",
- "text": "Calibration plot\n\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n dat <- tibble(y = y, phat = phat) |>\n mutate(bins = cut_number(phat, n = nbreaks))\n midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n sum_dat <- dat |>\n group_by(bins) |>\n summarise(\n p = mean(y, na.rm = TRUE),\n se = sqrt(p * (1 - p) / n())\n ) |>\n arrange(p)\n sum_dat$x <- midpts\n\n ggplot(sum_dat, aes(x = x)) +\n geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +\n geom_point(aes(y = p), colour = blue) +\n geom_abline(slope = 1, intercept = 0, colour = orange) +\n ylab(\"observed frequency\") +\n xlab(\"average predicted probability\") +\n coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n geom_rug(data = dat, aes(x = phat), sides = \"b\")\n}"
+ "section": "Tour of Rstudio",
+ "text": "Tour of Rstudio\nThings to note\n\nConsole\nTerminal\nScripts, .Rmd, Knit\nFiles, Projects\nGetting help\nEnvironment, Git"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#amazingly-well-calibrated",
- "href": "schedule/slides/00-classification-losses.html#amazingly-well-calibrated",
+ "objectID": "schedule/slides/00-r-review.html#simple-stuff",
+ "href": "schedule/slides/00-r-review.html#simple-stuff",
"title": "UBC Stat406 2023W",
- "section": "Amazingly well-calibrated",
- "text": "Amazingly well-calibrated\n\nbinary_calibration_plot(dat$y, dat$phat, 20L)"
+ "section": "Simple stuff",
+ "text": "Simple stuff\n\n\nVectors:\n\nx <- c(1, 3, 4)\nx[1]\n\n[1] 1\n\nx[-1]\n\n[1] 3 4\n\nrev(x)\n\n[1] 4 3 1\n\nc(x, x)\n\n[1] 1 3 4 1 3 4\n\n\n\n\n\nMatrices:\n\nx <- matrix(1:25, nrow = 5, ncol = 5)\nx[1,]\n\n[1] 1 6 11 16 21\n\nx[,-1]\n\n [,1] [,2] [,3] [,4]\n[1,] 6 11 16 21\n[2,] 7 12 17 22\n[3,] 8 13 18 23\n[4,] 9 14 19 24\n[5,] 10 15 20 25\n\nx[c(1,3), 2:3]\n\n [,1] [,2]\n[1,] 6 11\n[2,] 8 13"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#less-well-calibrated",
- "href": "schedule/slides/00-classification-losses.html#less-well-calibrated",
+ "objectID": "schedule/slides/00-r-review.html#simple-stuff-1",
+ "href": "schedule/slides/00-r-review.html#simple-stuff-1",
"title": "UBC Stat406 2023W",
- "section": "Less well-calibrated",
- "text": "Less well-calibrated"
+ "section": "Simple stuff",
+ "text": "Simple stuff\n\n\nLists\n\n(l <- list(\n a = letters[1:2], \n b = 1:4, \n c = list(a = 1)))\n\n$a\n[1] \"a\" \"b\"\n\n$b\n[1] 1 2 3 4\n\n$c\n$c$a\n[1] 1\n\nl$a\n\n[1] \"a\" \"b\"\n\nl$c$a\n\n[1] 1\n\nl[\"b\"] # compare to l[[\"b\"]] == l$b\n\n$b\n[1] 1 2 3 4\n\n\n\n\nData frames\n\n(dat <- data.frame(\n z = 1:5, \n b = 6:10, \n c = letters[1:5]))\n\n z b c\n1 1 6 a\n2 2 7 b\n3 3 8 c\n4 4 9 d\n5 5 10 e\n\nclass(dat)\n\n[1] \"data.frame\"\n\ndat$b\n\n[1] 6 7 8 9 10\n\ndat[1,]\n\n z b c\n1 1 6 a\n\n\n\n\nData frames are sort-of lists and sort-of matrices"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#true-positive-false-negative-sensitivity-specificity",
- "href": "schedule/slides/00-classification-losses.html#true-positive-false-negative-sensitivity-specificity",
+ "objectID": "schedule/slides/00-r-review.html#tibbles",
+ "href": "schedule/slides/00-r-review.html#tibbles",
"title": "UBC Stat406 2023W",
- "section": "True positive, false negative, sensitivity, specificity",
- "text": "True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n\n# correct predict positive / # actual positive (1 - FNR)\n\nFalse negative rate\n\n# incorrect predict negative / # actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n\n# correct predict negative / # actual negative\n\nFalse positive rate\n\n# incorrect predict positive / # actual negative (1 - TNR), Type I Error\n\nSensitivity\n\nTPR, 1 - Type II error\n\nSpecificity\n\nTNR, 1 - Type I error"
+ "section": "Tibbles",
+ "text": "Tibbles\nThese are {tidyverse} data frames\n\n(dat2 <- tibble(z = 1:5, b = z + 5, c = letters[z]))\n\n# A tibble: 5 × 3\n z b c \n <int> <dbl> <chr>\n1 1 6 a \n2 2 7 b \n3 3 8 c \n4 4 9 d \n5 5 10 e \n\nclass(dat2)\n\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n\n\nWe’ll return to classes in a moment. A tbl_df is a “subclass” of data.frame.\nAnything that data.frame can do, tbl_df can do (better).\nFor instance, the printing is more informative.\nAlso, you can construct one by referencing previously constructed columns."
},
{
- "objectID": "schedule/slides/00-classification-losses.html#roc-and-thresholds",
- "href": "schedule/slides/00-classification-losses.html#roc-and-thresholds",
+ "objectID": "schedule/slides/00-r-review.html#understanding-signatures",
+ "href": "schedule/slides/00-r-review.html#understanding-signatures",
"title": "UBC Stat406 2023W",
- "section": "ROC and thresholds",
- "text": "ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n\nTPR (sensitivity) vs. FPR (1 - specificity)\n\nAUC (Area under the curve)\n\nIntegral of ROC. Closer to 1 is better.\n\n\nSo far, we’ve been thresholding at 0.5, though you shouldn’t always do that.\nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\nTo make the ROC we look at our errors as we vary the cutoff"
+ "section": "Understanding signatures",
+ "text": "Understanding signatures\n\n\nCode\nsig <- sig::sig\n\n\n\nsig(lm)\n\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n\nsig(`+`)\n\nfn <- function(e1, e2)\n\nsig(dplyr::filter)\n\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n\nsig(stats::filter)\n\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n\nsig(rnorm)\n\nfn <- function(n, mean = 0, sd = 1)"
},
{
- "objectID": "schedule/slides/00-classification-losses.html#roc-curve",
- "href": "schedule/slides/00-classification-losses.html#roc-curve",
+ "objectID": "schedule/slides/00-r-review.html#these-are-all-the-same",
+ "href": "schedule/slides/00-r-review.html#these-are-all-the-same",
"title": "UBC Stat406 2023W",
- "section": "ROC curve",
- "text": "ROC curve\n\n\nroc <- function(prediction, y) {\n op <- order(prediction, decreasing = TRUE)\n preds <- prediction[op]\n y <- y[op]\n noty <- 1 - y\n if (any(duplicated(preds))) {\n y <- rev(tapply(y, preds, sum))\n noty <- rev(tapply(noty, preds, sum))\n }\n tibble(\n FPR = cumsum(noty) / sum(noty),\n TPR = cumsum(y) / sum(y)\n )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n geom_step(colour = blue, size = 2) +\n geom_abline(slope = 1, intercept = 0)"
+ "section": "These are all the same",
+ "text": "These are all the same\n\nset.seed(12345)\nrnorm(3)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\nset.seed(12345)\nrnorm(3, 0, 1)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\n\n\nFunctions can have default values.\nYou may, but don’t have to, name the arguments\nIf you name them, you can pass them out of order (but you shouldn’t)."
},
{
- "objectID": "schedule/slides/00-classification-losses.html#other-stuff",
- "href": "schedule/slides/00-classification-losses.html#other-stuff",
+ "objectID": "schedule/slides/00-r-review.html#write-lots-of-functions.-i-cant-emphasize-this-enough.",
+ "href": "schedule/slides/00-r-review.html#write-lots-of-functions.-i-cant-emphasize-this-enough.",
"title": "UBC Stat406 2023W",
- "section": "Other stuff",
- "text": "Other stuff\n\n\nSource: worth exploring Wikipedia\n\n\n\nUBC Stat 406 - 2023"
+ "section": "Write lots of functions. I can’t emphasize this enough.",
+ "text": "Write lots of functions. I can’t emphasize this enough.\n\n\n\nf <- function(arg1, arg2, arg3 = 12, ...) {\n stuff <- arg1 * arg3\n stuff2 <- stuff + arg2\n plot(arg1, stuff2, ...)\n return(stuff2)\n}\nx <- rnorm(100)\n\n\n\n\ny1 <- f(x, 3, 15, col = 4, pch = 19)\n\n\n\n\n\n\n\nstr(y1)\n\n num [1:100] -3.8 12.09 -24.27 12.45 -1.14 ..."
},
{
- "objectID": "course-setup.html",
- "href": "course-setup.html",
- "title": "Guide for setting up the course infrastructure",
+ "objectID": "schedule/slides/00-r-review.html#outputs-vs.-side-effects",
+ "href": "schedule/slides/00-r-review.html#outputs-vs.-side-effects",
+ "title": "UBC Stat406 2023W",
+ "section": "Outputs vs. Side effects",
+ "text": "Outputs vs. Side effects\n\n\n\nSide effects are things a function does, outputs can be assigned to variables\nA good example is the hist function\nYou have probably only seen the side effect which is to plot the histogram\n\n\nmy_histogram <- hist(rnorm(1000))\n\n\n\n\n\n\n\n\n\n\n\nstr(my_histogram)\n\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 89 142 200 193 170 74 38 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.178 0.284 0.4 0.386 0.34 0.148 0.076 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n\nclass(my_histogram)\n\n[1] \"histogram\""
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#when-writing-functions-program-defensively-ensure-behaviour",
+ "href": "schedule/slides/00-r-review.html#when-writing-functions-program-defensively-ensure-behaviour",
+ "title": "UBC Stat406 2023W",
+ "section": "When writing functions, program defensively, ensure behaviour",
+ "text": "When writing functions, program defensively, ensure behaviour\n\n\n\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n\n[1] 3\n\nincrementer(1:4)\n\n[1] 2 3 4 5\n\nincrementer(\"a\")\n\nError in x + 1: non-numeric argument to binary operator\n\n\n\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n\n\n\n\n\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n\nError in incrementer(\"a\"): `x` must be numeric\n\nincrementer(2, -3) ## oops!\n\n[1] 3\n\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n\n[1] -1"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#how-to-keep-track",
+ "href": "schedule/slides/00-r-review.html#how-to-keep-track",
+ "title": "UBC Stat406 2023W",
+ "section": "How to keep track",
+ "text": "How to keep track\n\n\n\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n if (!is.numeric(inc_by)) {\n stop(\"`inc_by` must be numeric\")\n }\n x + inc_by\n}\nexpect_error(incrementer(\"a\"))\nexpect_equal(incrementer(1:3), 2:4)\nexpect_equal(incrementer(2, -3), -1)\nexpect_error(incrementer(1, \"b\"))\nexpect_identical(incrementer(1:3), 2:4)\n\nError: incrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n\n\n\n\n\nis.integer(2:4)\n\n[1] TRUE\n\nis.integer(incrementer(1:3))\n\n[1] FALSE\n\nexpect_identical(incrementer(1:3, 1L), 2:4)\n\n\n\n\n\n\n\n\n\n\nImportant\n\n\nIf you copy something, write a function.\nValidate your arguments.\nTo ensure proper functionality, write tests to check if inputs result in predicted outputs."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#classes",
+ "href": "schedule/slides/00-r-review.html#classes",
+ "title": "UBC Stat406 2023W",
+ "section": "Classes",
+ "text": "Classes\n\n\nWe saw some of these earlier:\n\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib )\nclass(tib)\n\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n\nclass(mdl)\n\n[1] \"lm\"\n\n\nThe class allows for the use of “methods”\n\nprint(mdl)\n\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n -0.1742 1.0454 2.0470 \n\n\n\n\n\nR “knows what to do” when you print() an object of class \"lm\".\nprint() is called a “generic” function.\nYou can create “methods” that get dispatched.\nFor any generic, R looks for a method for the class.\nIf available, it calls that function."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#viewing-the-dispatch-chain",
+ "href": "schedule/slides/00-r-review.html#viewing-the-dispatch-chain",
+ "title": "UBC Stat406 2023W",
+ "section": "Viewing the dispatch chain",
+ "text": "Viewing the dispatch chain\n\nsloop::s3_dispatch(print(incrementer))\n\n=> print.function\n * print.default\n\nsloop::s3_dispatch(print(tib))\n\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n\nsloop::s3_dispatch(print(mdl))\n\n=> print.lm\n * print.default"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#r-geeky-but-important",
+ "href": "schedule/slides/00-r-review.html#r-geeky-but-important",
+ "title": "UBC Stat406 2023W",
+ "section": "R-Geeky But Important",
+ "text": "R-Geeky But Important\nThere are lots of generic functions in R\nCommon ones are print(), summary(), and plot().\nAlso, lots of important statistical modelling concepts: residuals() coef()\n(In python, these work the opposite way: obj.residuals. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.)\n\nThe convention is that the specialized function is named method.class(), e.g., summary.lm().\nIf no specialized function is defined, R will try to use method.default().\n\nFor this reason, R programmers try to avoid . in names of functions or objects."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#wherefore-methods",
+ "href": "schedule/slides/00-r-review.html#wherefore-methods",
+ "title": "UBC Stat406 2023W",
+ "section": "Wherefore methods?",
+ "text": "Wherefore methods?\n\nThe advantage is that you don’t have to learn a totally new syntax to grab residuals or plot things\nYou just use residuals(mdl) whether mdl has class lm could have been done two centuries ago, or a Batrachian Emphasis Machine which won’t be invented for another five years.\nThe one draw-back is the help pages for the generic methods tend to be pretty vague\nCompare ?summary with ?summary.lm."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#different-environments",
+ "href": "schedule/slides/00-r-review.html#different-environments",
+ "title": "UBC Stat406 2023W",
+ "section": "Different environments",
+ "text": "Different environments\n\nThese are often tricky, but are very common.\nMost programming languages have this concept in one way or another.\nIn R code run in the Console produces objects in the “Global environment”\nYou can see what you create in the “Environment” tab.\nBut there’s lots of other stuff.\nMany packages are automatically loaded at startup, so you have access to the functions and data inside\n\nFor example mean(), lm(), plot(), iris (technically iris is lazy-loaded, meaning it’s not in memory until you call it, but it is available)"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#section",
+ "href": "schedule/slides/00-r-review.html#section",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "Other packages require you to load them with library(pkg) before their functions are available.\nBut, you can call those functions by prefixing the package name ggplot2::ggplot().\nYou can also access functions that the package developer didn’t “export” for use with ::: like dplyr:::as_across_fn_call()\n\n\nThat is all about accessing “objects in package environments”"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#other-issues-with-environments",
+ "href": "schedule/slides/00-r-review.html#other-issues-with-environments",
+ "title": "UBC Stat406 2023W",
+ "section": "Other issues with environments",
+ "text": "Other issues with environments\nAs one might expect, functions create an environment inside the function.\n\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n\n[1] 14\n\n\nNon-trivial cases are data-masking environments.\n\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n\nError in eval(expr, envir, enclos): object 'x2' not found\n\n\n\nlm() looks “inside” the tib to find y and x2\nThe data variables are added to the lm() environment"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#other-issues-with-environments-1",
+ "href": "schedule/slides/00-r-review.html#other-issues-with-environments-1",
+ "title": "UBC Stat406 2023W",
+ "section": "Other issues with environments",
+ "text": "Other issues with environments\nWhen Knit, .Rmd files run in their OWN environment.\nThey are run from top to bottom, with code chunks depending on previous\nThis makes them reproducible.\nJupyter notebooks don’t do this. 😱\nObjects in your local environment are not available in the .Rmd\nObjects in the .Rmd are not available locally.\n\n\n\n\n\n\nTip\n\n\nThe most frequent error I see is:\n\nrunning chunks individually, 1-by-1, and it works\nKnitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Environment that don’t exist in the .Rmd"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#section-1",
+ "href": "schedule/slides/00-r-review.html#section-1",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "This error also happens because:\n\nlibrary() calls were made globally but not in the .Rmd\n\nso the packages aren’t loaded\n\npaths to data or other objects are not relative to the .Rmd in your file system\n\nthey must be\n\nCarefully keeping Labs / Assignments in their current location will help to avoid some of these."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#how-to-fix-code",
+ "href": "schedule/slides/00-r-review.html#how-to-fix-code",
+ "title": "UBC Stat406 2023W",
+ "section": "How to fix code",
+ "text": "How to fix code\n\nIf you’re using a function in a package, start with ?function to see the help\n\nMake sure you’re calling the function correctly.\nTry running the examples.\npaste the error into Google (if you share the error on Slack, I often do this first)\nGo to the package website if it exists, and browse around\n\nIf your .Rmd won’t Knit\n\nDid you make the mistake on the last slide?\nDid it Knit before? Then the bug is in whatever you added.\nDid you never Knit it? Why not?\nCall rstudioapi::restartSession(), then run the Chunks 1-by-1"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#section-2",
+ "href": "schedule/slides/00-r-review.html#section-2",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "Adding browser()\n\nOnly useful with your own functions.\nOpen the script with the function, and add browser() to the code somewhere\nThen call your function.\nThe execution will Stop where you added browser() and you’ll have access to the local environment to play around"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#reproducible-examples",
+ "href": "schedule/slides/00-r-review.html#reproducible-examples",
+ "title": "UBC Stat406 2023W",
+ "section": "Reproducible examples",
+ "text": "Reproducible examples\n\n\n\n\n\n\nQuestion I get on Slack that I hate:\n\n\n“I ran the code like you had on Slide 39, but it didn’t work.”\n\n\n\n\nIf you want to ask me why the code doesn’t work, you need to show me what’s wrong.\n\n\n\n\n\n\n\nDon’t just paste a screenshot!\n\n\nUnless you get lucky, I won’t be able to figure it out from that. And we’ll both get frustrated.\n\n\n\nWhat you need is a Reproducible Example or reprex.\n\nThis is a small chunk of code that\n\nruns in it’s own environment\nand produces the error."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#reproducible-examples-how-it-works",
+ "href": "schedule/slides/00-r-review.html#reproducible-examples-how-it-works",
+ "title": "UBC Stat406 2023W",
+ "section": "Reproducible examples, How it works",
+ "text": "Reproducible examples, How it works\n\nOpen a new .R script.\nPaste your buggy code in the file (no need to save)\nEdit your code to make sure it’s “enough to produce the error” and nothing more. (By rerunning the code a few times.)\nCopy your code.\nCall reprex::reprex(venue = \"r\") from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\nIf it creates other errors, that may be the problem. You may fix the bug on your own!\nIf it doesn’t have errors, then your global environment is Farblunget.\nThe Output is now on your clipboard. Go to Slack and paste it in a message. Then press Cmd+Shift+Enter (on Mac) or Ctrl+Shift+Enter (Windows/Linux). Under Type, select R.\nSend the message, perhaps with more description and an SOS emoji.\n\n\n\n\n\n\n\nNote\n\n\nBecause Reprex runs in it’s own environment, it doesn’t have access to any of the libraries you loaded or the stuff in your global environment. You’ll have to load these things in the script."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#tidyverse-is-huge",
+ "href": "schedule/slides/00-r-review.html#tidyverse-is-huge",
+ "title": "UBC Stat406 2023W",
+ "section": "{tidyverse} is huge",
+ "text": "{tidyverse} is huge\nCore tidyverse is nearly 30 different R packages, but we’re going to just talk about a few of them.\nFalls roughly into a few categories:\n\nConvenience functions: {magrittr} and many many others.\nData processing: {dplyr} and many others.\nGraphing: {ggplot2} and some others like {scales}.\nUtilities\n\n\n\nWe’re going to talk quickly about some of it, but ignore much of 2.\nThere’s a lot that’s great about these packages, especially ease of data processing.\nBut it doesn’t always jive with base R (it’s almost a separate proglang at this point)."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#piping",
+ "href": "schedule/slides/00-r-review.html#piping",
+ "title": "UBC Stat406 2023W",
+ "section": "Piping",
+ "text": "Piping\nThis was introduced by {magrittr} as %>%,\nbut is now in base R (>=4.1.0) as |>.\nNote: there are other pipes in {magrittr} (e.g. %$% and %T%) but I’ve never used them.\nI’ve used the old version for so long, that it’s hard for me to adopt the new one.\nThe point of the pipe is to logically sequence nested operations"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#example",
+ "href": "schedule/slides/00-r-review.html#example",
+ "title": "UBC Stat406 2023W",
+ "section": "Example",
+ "text": "Example\n\n\n\nmse1 <- print(\n sum(\n residuals(\n lm(y~., data = mutate(\n tib, \n x3 = x1^2,\n x4 = log(x2 + abs(min(x2)) + 1)\n )\n )\n )^2\n )\n)\n\n[1] 6.469568e-29\n\n\n\n\nmse2 <- tib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() |>\n print()\n\n[1] 6.469568e-29"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#section-4",
+ "href": "schedule/slides/00-r-review.html#section-4",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "It may seem like we should push this all the way\n\ntib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() ->\n mse3\n\nThis works, but it’s really annoying."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#a-new-one",
+ "href": "schedule/slides/00-r-review.html#a-new-one",
+ "title": "UBC Stat406 2023W",
+ "section": "A new one…",
+ "text": "A new one…\nJust last week, I learned\n\nlibrary(magrittr)\ntib <- tibble(x = 1:5, z = 6:10)\ntib <- tib |> mutate(b = x + z)\ntib\n\n# A tibble: 5 × 3\n x z b\n <int> <int> <int>\n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15\n\n# start over\ntib <- tibble(x = 1:5, z = 6:10)\ntib %<>% mutate(b = x + z)\ntib\n\n# A tibble: 5 × 3\n x z b\n <int> <int> <int>\n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#data-processing-in-dplyr",
+ "href": "schedule/slides/00-r-review.html#data-processing-in-dplyr",
+ "title": "UBC Stat406 2023W",
+ "section": "Data processing in {dplyr}",
+ "text": "Data processing in {dplyr}\nThis package has all sorts of things. And it interacts with {tibble} generally.\nThe basic idea is “tibble in, tibble out”.\nSatisfies data masking which means you can refer to columns by name or use helpers like ends_with(\"_rate\")\nMajorly useful operations:\n\nselect() (chooses columns to keep)\nmutate() (showed this already)\ngroup_by()\npivot_longer() and pivot_wider()\nleft_join() and full_join()\nsummarise()\n\n\n\n\n\n\n\nNote\n\n\nfilter() and select() are functions in Base R.\nSometimes you get 🐞 because it called the wrong version.\nTo be sure, prefix it like dplyr::select()."
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#a-useful-data-frame",
+ "href": "schedule/slides/00-r-review.html#a-useful-data-frame",
+ "title": "UBC Stat406 2023W",
+ "section": "A useful data frame",
+ "text": "A useful data frame\n\nlibrary(epidatr)\ncovid <- covidcast(\n source = \"jhu-csse\",\n signals = \"confirmed_7dav_incidence_prop,deaths_7dav_incidence_prop\",\n time_type = \"day\",\n geo_type = \"state\",\n time_values = epirange(20220801, 20220821),\n geo_values = \"ca,wa\") |>\n fetch() |>\n select(geo_value, time_value, signal, value)\n\ncovid\n\n# A tibble: 84 × 4\n geo_value time_value signal value\n <chr> <date> <chr> <dbl>\n 1 ca 2022-08-01 confirmed_7dav_incidence_prop 45.4\n 2 wa 2022-08-01 confirmed_7dav_incidence_prop 27.7\n 3 ca 2022-08-02 confirmed_7dav_incidence_prop 44.9\n 4 wa 2022-08-02 confirmed_7dav_incidence_prop 27.7\n 5 ca 2022-08-03 confirmed_7dav_incidence_prop 44.5\n 6 wa 2022-08-03 confirmed_7dav_incidence_prop 26.6\n 7 ca 2022-08-04 confirmed_7dav_incidence_prop 42.3\n 8 wa 2022-08-04 confirmed_7dav_incidence_prop 26.6\n 9 ca 2022-08-05 confirmed_7dav_incidence_prop 40.7\n10 wa 2022-08-05 confirmed_7dav_incidence_prop 34.6\n# ℹ 74 more rows"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#examples",
+ "href": "schedule/slides/00-r-review.html#examples",
+ "title": "UBC Stat406 2023W",
+ "section": "Examples",
+ "text": "Examples\nRename the signal to something short.\n\ncovid <- covid |> \n mutate(signal = case_when(\n str_starts(signal, \"confirmed\") ~ \"case_rate\", \n TRUE ~ \"death_rate\"\n ))\n\nSort by time_value then geo_value\n\ncovid <- covid |> arrange(time_value, geo_value)\n\nCalculate grouped medians\n\ncovid |> \n group_by(geo_value, signal) |>\n summarise(med = median(value), .groups = \"drop\")\n\n# A tibble: 4 × 3\n geo_value signal med\n <chr> <chr> <dbl>\n1 ca case_rate 33.2 \n2 ca death_rate 0.112\n3 wa case_rate 23.2 \n4 wa death_rate 0.178"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#examples-1",
+ "href": "schedule/slides/00-r-review.html#examples-1",
+ "title": "UBC Stat406 2023W",
+ "section": "Examples",
+ "text": "Examples\nSplit the data into two tibbles by signal\n\ncases <- covid |> \n filter(signal == \"case_rate\") |>\n rename(case_rate = value) |> select(-signal)\ndeaths <- covid |> \n filter(signal == \"death_rate\") |>\n rename(death_rate = value) |> select(-signal)\n\nJoin them together\n\njoined <- full_join(cases, deaths, by = c(\"geo_value\", \"time_value\"))\n\nDo the same thing by pivoting\n\ncovid |> pivot_wider(names_from = signal, values_from = value)\n\n# A tibble: 42 × 4\n geo_value time_value case_rate death_rate\n <chr> <date> <dbl> <dbl>\n 1 ca 2022-08-01 45.4 0.105\n 2 wa 2022-08-01 27.7 0.169\n 3 ca 2022-08-02 44.9 0.106\n 4 wa 2022-08-02 27.7 0.169\n 5 ca 2022-08-03 44.5 0.107\n 6 wa 2022-08-03 26.6 0.173\n 7 ca 2022-08-04 42.3 0.112\n 8 wa 2022-08-04 26.6 0.173\n 9 ca 2022-08-05 40.7 0.116\n10 wa 2022-08-05 34.6 0.225\n# ℹ 32 more rows"
+ },
+ {
+ "objectID": "schedule/slides/00-r-review.html#plotting-with-ggplot2",
+ "href": "schedule/slides/00-r-review.html#plotting-with-ggplot2",
+ "title": "UBC Stat406 2023W",
+ "section": "Plotting with {ggplot2}",
+ "text": "Plotting with {ggplot2}\n\nEverything you can do with ggplot(), you can do with plot(). But the defaults are much prettier.\nIt’s also much easier to adjust by aesthetics / panels by factors.\nIt also uses “data masking”: data goes into ggplot(data = mydata), then the columns are available to the rest.\nIt (sort of) pipes, but by adding layers with +\nIt strongly prefers “long” data frames over “wide” data frames.\n\n\nI’ll give a very fast overview of some confusing bits."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#meta-lecture",
+ "href": "schedule/slides/00-intro-to-class.html#meta-lecture",
+ "title": "UBC Stat406 2023W",
+ "section": "00 Intro to class",
+ "text": "00 Intro to class\nStat 406\nDaniel J. McDonald\nLast modified – 17 August 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\]"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#about-me",
+ "href": "schedule/slides/00-intro-to-class.html#about-me",
+ "title": "UBC Stat406 2023W",
+ "section": "About me",
+ "text": "About me\n\n\n\nDaniel J. McDonald\ndaniel@stat.ubc.ca\nhttp://dajmcdon.github.io/\nAssociate Professor, Department of Statistics"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#philosophy-of-the-class",
+ "href": "schedule/slides/00-intro-to-class.html#philosophy-of-the-class",
+ "title": "UBC Stat406 2023W",
+ "section": "Philosophy of the class",
+ "text": "Philosophy of the class\nI and the TAs are here to help you learn. Ask questions.\nWe encourage engagement, curiosity and generosity\nWe favour steady work through the Term (vs. sleeping until finals)\n\nThe assessments attempt to reflect this ethos."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#more-philosophy",
+ "href": "schedule/slides/00-intro-to-class.html#more-philosophy",
+ "title": "UBC Stat406 2023W",
+ "section": "More philosophy",
+ "text": "More philosophy\nWhen the term ends, I want\n\nYou to be better at coding.\nYou to have an understanding of the variety of methods available to do prediction and data analysis.\nYou to articulate their strengths and weaknesses.\nYou to be able to choose between different methods using your intuition and the data.\n\n\nI do not want\n\nYou to be under undo stress\nYou to feel the need to cheat, plagiarize, or drop the course\nYou to feel treated unfairly."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#section",
+ "href": "schedule/slides/00-intro-to-class.html#section",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "I promise\n\nTo grade/mark fairly. Good faith effort will be rewarded\nTo be flexible. This semester (like the last 4) is different for everyone.\nTo understand and adapt to issues.\n\n\nI do not promise that you will all get the grade you want."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#on-covid",
+ "href": "schedule/slides/00-intro-to-class.html#on-covid",
+ "title": "UBC Stat406 2023W",
+ "section": "On COVID",
+ "text": "On COVID\n\n\n\nI work on COVID a lot.\nStatistics is hugely important.\n\nPolicies (TL; DR)\n\nI encourage you to wear a mask\nDo NOT come to class if you are possibly sick\nBe kind and considerate to others\nThe Marking scheme is flexible enough to allow some missed classes"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#section-1",
+ "href": "schedule/slides/00-intro-to-class.html#section-1",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "We’ll talk about lots of ML models\nBut our focus is on how to “understand” everything in this diagram.\nHow do we interpret? Evaluate? Choose a model?\nWhat are the implications / assumptions implied by our choices?\nDeep understanding of statistics helps with intuition."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#predictive-models",
+ "href": "schedule/slides/00-intro-to-class.html#predictive-models",
+ "title": "UBC Stat406 2023W",
+ "section": "Predictive models",
+ "text": "Predictive models\n\n1. Preprocessing\ncentering / scaling / factors-to-dummies / basis expansion / missing values / dimension reduction / discretization / transformations\n2. Model fitting\nWhich box do you use?\n3. Prediction\nRepeat all the preprocessing on new data. But be careful.\n4. Postprocessing, interpretation, and evaluation"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#section-5",
+ "href": "schedule/slides/00-intro-to-class.html#section-5",
+ "title": "UBC Stat406 2023W",
+ "section": "",
+ "text": "Source: https://vas3k.com/blog/machine_learning/"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#modules",
+ "href": "schedule/slides/00-intro-to-class.html#modules",
+ "title": "UBC Stat406 2023W",
+ "section": "6 modules",
+ "text": "6 modules\n\n\n\nReview (today and next week)\nModel accuracy and selection\nRegularization, smoothing, trees\nClassifiers\nModern techniques (classification and regression)\nUnsupervised learning\n\n\n\nEach module is approximately 2 weeks long\nEach module is based on a collection of readings and lectures\nEach module (except the review) has a homework assignment"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#assessments",
+ "href": "schedule/slides/00-intro-to-class.html#assessments",
+ "title": "UBC Stat406 2023W",
+ "section": "Assessments",
+ "text": "Assessments\nEffort-based\nTotal across three components: 65 points, any way you want\n\nLabs, up to 20 points (2 each)\nAssignments, up to 50 points (10 each)\nClickers, up to 10 points\n\n\nKnowledge-based\nFinal Exam, 35 points"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#why-this-scheme",
+ "href": "schedule/slides/00-intro-to-class.html#why-this-scheme",
+ "title": "UBC Stat406 2023W",
+ "section": "Why this scheme?",
+ "text": "Why this scheme?\n\nYou stay on top of the material\nYou come to class and participate\nYou gain coding practice in the labs\nYou work hard on the assignments\n\n\n\n\n\n\n\nMost of this is Effort Based\n\n\nwork hard, guarantee yourself 65%"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#time-expectations-per-week",
+ "href": "schedule/slides/00-intro-to-class.html#time-expectations-per-week",
+ "title": "UBC Stat406 2023W",
+ "section": "Time expectations per week:",
+ "text": "Time expectations per week:\n\nComing to class – 3 hours\nReading the book – 1 hour\nLabs – 1 hour\nHomework – 4 hours\nStudy / thinking / playing – 1 hour\n\n\nShow the course website https://ubc-stat.github.io/stat-406/"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#labs-assignments",
+ "href": "schedule/slides/00-intro-to-class.html#labs-assignments",
+ "title": "UBC Stat406 2023W",
+ "section": "Labs / Assignments",
+ "text": "Labs / Assignments\nThe goal is to “Do the work”\n\n\nAssignments\n\nNot easy, especially the first 2, especially if you are unfamiliar with R / Rmarkdown / ggplot\nYou may revise to raise your score to 7/10, see Syllabus. Only if you get lose 3+ for content (penalties can’t be redeemed).\nDon’t leave these for the last minute\n\n\nLabs\n\nLabs should give you practice, allow for questions with the TAs.\nThey are due at 2300 on the day of your lab, lightly graded.\nYou may do them at home, but you must submit individually (in lab, you may share submission)\nLabs are lightly graded"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#clickers",
+ "href": "schedule/slides/00-intro-to-class.html#clickers",
+ "title": "UBC Stat406 2023W",
+ "section": "Clickers",
+ "text": "Clickers\n\nQuestions are similar to the Final\n0 points for skipping, 2 points for trying, 4 points for correct\n\nAverage of 3 = 10 points (the max)\nAverage of 2 = 5 points\nAverage of 1 = 0 points\ntotal = max(0, min(5 * points / N - 5, 10))\n\nBe sure to sync your device in Canvas.\n\n\n\n\n\n\n\nDon’t do this!\n\n\nAverage < 1 drops your Final Mark 1 letter grade.\nA- becomes B-, C+ becomes D."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#final-exam",
+ "href": "schedule/slides/00-intro-to-class.html#final-exam",
+ "title": "UBC Stat406 2023W",
+ "section": "Final Exam",
+ "text": "Final Exam\n\nScheduled by the university.\nIt is hard\nThe median last year was 50% \\(\\Rightarrow\\) A-\n\nPhilosophy:\n\nIf you put in the effort, you’re guaranteed a C+.\nBut to get an A+, you should really deeply understand the material.\n\nNo penalty for skipping the final.\nIf you’re cool with C+ and hate tests, then that’s fine."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#advice",
+ "href": "schedule/slides/00-intro-to-class.html#advice",
+ "title": "UBC Stat406 2023W",
+ "section": "Advice",
+ "text": "Advice\n\nSkipping HW makes it difficult to get to 65\nCome to class!\nYes it’s at 8am. I hate it too.\nTo compensate, I will record the class and post to Canvas.\nIn terms of last year’s class, attendance in lecture and active engagement (asking questions, coming to office hours, etc.) is the best predictor of success.\n\n\nQuestions?"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#textbooks",
+ "href": "schedule/slides/00-intro-to-class.html#textbooks",
+ "title": "UBC Stat406 2023W",
+ "section": "Textbooks",
+ "text": "Textbooks\n\n\n\n\n\n\nAn Introduction to Statistical Learning\n\n\nJames, Witten, Hastie, Tibshirani, 2013, Springer, New York. (denoted [ISLR])\nAvailable free online: http://statlearning.com/\n\n\n\n\n\n\n\n\n\nThe Elements of Statistical Learning\n\n\nHastie, Tibshirani, Friedman, 2009, Second Edition, Springer, New York. (denoted [ESL])\nAlso available free online: https://web.stanford.edu/~hastie/ElemStatLearn/\n\n\n\n\nIt’s worth your time to read.\nIf you need more practice, read the Worksheets."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#computer",
+ "href": "schedule/slides/00-intro-to-class.html#computer",
+ "title": "UBC Stat406 2023W",
+ "section": "Computer",
+ "text": "Computer\n\n\n\n\n\nAll coding in R\nSuggest you use RStudio IDE\nSee https://ubc-stat.github.io/stat-406/ for instructions\nIt tells you how to install what you will need, hopefully all at once, for the whole Term.\nWe will use R and we assume some background knowledge.\nLinks to useful supplementary resources are available on the website.\n\n\n\n\n\n\nThis course is not an intro to R / python / MongoDB / SQL."
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#other-resources",
+ "href": "schedule/slides/00-intro-to-class.html#other-resources",
+ "title": "UBC Stat406 2023W",
+ "section": "Other resources",
+ "text": "Other resources\n\nCanvas\n\nGrades, links to videos from class\n\nCourse website\n\nAll the material (slides, extra worksheets) https://ubc-stat.github.io/stat-406\n\nSlack\n\nDiscussion board, questions.\n\nGithub\n\nHomework / Lab submission\n\n\n\n\n\nAll lectures will be recorded and posted\nI cannot guarantee that they will all work properly (sometimes I mess it up)"
+ },
+ {
+ "objectID": "schedule/slides/00-intro-to-class.html#some-more-words",
+ "href": "schedule/slides/00-intro-to-class.html#some-more-words",
+ "title": "UBC Stat406 2023W",
+ "section": "Some more words",
+ "text": "Some more words\n\nLectures are hard. It’s 8am, everyone’s tired.\nCoding is hard. I hope you’ll get better at it.\nI strongly urge you to get up at the same time everyday. My plan is to go to the gym on MWF. It’s really hard to sleep in until 10 on MWF and make class at 8 on T/Th.\n\n\n\nLet’s be kind and understanding to each other.\nI have to give you a grade, but I want that grade to reflect your learning and effort, not other junk.\n\n\n\nIf you need help, please ask."
+ },
+ {
+ "objectID": "schedule/slides/00-cv-for-many-models.html#meta-lecture",
+ "href": "schedule/slides/00-cv-for-many-models.html#meta-lecture",
+ "title": "UBC Stat406 2023W",
+ "section": "00 CV for many models",
+ "text": "00 CV for many models\nStat 406\nDaniel J. McDonald\nLast modified – 19 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ },
+ {
+ "objectID": "schedule/slides/00-cv-for-many-models.html#some-data-and-4-models",
+ "href": "schedule/slides/00-cv-for-many-models.html#some-data-and-4-models",
+ "title": "UBC Stat406 2023W",
+ "section": "Some data and 4 models",
+ "text": "Some data and 4 models\n\ndata(\"mobility\", package = \"Stat406\")\n\nModel 1: Lasso on all predictors, use CV min\nModel 2: Ridge on all predictors, use CV min\nModel 3: OLS on all predictors (no tuning parameters)\nModel 4: (1) Lasso on all predictors, then (2) OLS on those chosen at CV min\n\nHow do I decide between these 4 models?"
+ },
+ {
+ "objectID": "schedule/slides/00-cv-for-many-models.html#cv-functions",
+ "href": "schedule/slides/00-cv-for-many-models.html#cv-functions",
+ "title": "UBC Stat406 2023W",
+ "section": "CV functions",
+ "text": "CV functions\n\nkfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {\n fold_labels <- sample(rep(seq_len(kfolds), length.out = nrow(data)))\n errors <- double(kfolds)\n for (fold in seq_len(kfolds)) {\n test_rows <- fold_labels == fold\n train <- data[!test_rows, ]\n test <- data[test_rows, ]\n current_model <- estimator(train)\n test$.preds <- predictor(current_model, test)\n errors[fold] <- error_fun(test)\n }\n mean(errors)\n}\n\nloo_cv <- function(dat) {\n mdl <- lm(Mobility ~ ., data = dat)\n mean( abs(residuals(mdl)) / abs(1 - hatvalues(mdl)) ) # MAE version\n}"
+ },
+ {
+ "objectID": "schedule/slides/00-cv-for-many-models.html#experiment-setup",
+ "href": "schedule/slides/00-cv-for-many-models.html#experiment-setup",
+ "title": "UBC Stat406 2023W",
+ "section": "Experiment setup",
+ "text": "Experiment setup\n\n# prepare our data\n# note that mob has only continuous predictors, otherwise could be trouble\nmob <- mobility[complete.cases(mobility), ] |> select(-ID, -State, -Name)\n# avoid doing this same operation a bunch\nxmat <- function(dat) dat |> select(!Mobility) |> as.matrix()\n\n# set up our model functions\nlibrary(glmnet)\nmod1 <- function(dat, ...) cv.glmnet(xmat(dat), dat$Mobility, type.measure = \"mae\", ...)\nmod2 <- function(dat, ...) cv.glmnet(xmat(dat), dat$Mobility, alpha = 0, type.measure = \"mae\", ...)\nmod3 <- function(dat, ...) glmnet(xmat(dat), dat$Mobility, lambda = 0, ...) # just does lm()\nmod4 <- function(dat, ...) cv.glmnet(xmat(dat), dat$Mobility, relax = TRUE, gamma = 1, type.measure = \"mae\", ...)\n\n# this will still \"work\" on mod3, because there's only 1 s\npredictor <- function(mod, dat) drop(predict(mod, newx = xmat(dat), s = \"lambda.min\"))\n\n# chose mean absolute error just 'cause\nerror_fun <- function(testdata) mean(abs(testdata$Mobility - testdata$.preds))"
+ },
+ {
+ "objectID": "schedule/slides/00-cv-for-many-models.html#run-the-experiment",
+ "href": "schedule/slides/00-cv-for-many-models.html#run-the-experiment",
+ "title": "UBC Stat406 2023W",
+ "section": "Run the experiment",
+ "text": "Run the experiment\n\nall_model_funs <- lst(mod1, mod2, mod3, mod4)\nall_fits <- map(all_model_funs, .f = exec, dat = mob)\n\n# unfortunately, does different splits for each method, so we use 10, \n# it would be better to use the _SAME_ splits\nten_fold_cv <- map_dbl(all_model_funs, ~ kfold_cv(mob, .x, predictor, error_fun, 10)) \n\nin_sample_cv <- c(\n mod1 = min(all_fits[[1]]$cvm),\n mod2 = min(all_fits[[2]]$cvm),\n mod3 = loo_cv(mob),\n mod4 = min(all_fits[[4]]$cvm)\n)\n\ntib <- bind_rows(in_sample_cv, ten_fold_cv)\ntib$method = c(\"in_sample\", \"out_of_sample\")\ntib\n\n# A tibble: 2 × 5\n mod1 mod2 mod3 mod4 method \n <dbl> <dbl> <dbl> <dbl> <chr> \n1 0.0159 0.0161 0.0164 0.0156 in_sample \n2 0.0158 0.0161 0.0165 0.0161 out_of_sample\n\n\n\n\nUBC Stat 406 - 2023"
+ },
+ {
+ "objectID": "schedule/index.html",
+ "href": "schedule/index.html",
+ "title": " Schedule",
+ "section": "",
+ "text": "Required readings and lecture videos are listed below for each module. Readings from [ISLR] are always required while those from [ESL] are optional and supplemental."
+ },
+ {
+ "objectID": "schedule/index.html#introduction-and-review",
+ "href": "schedule/index.html#introduction-and-review",
+ "title": " Schedule",
+ "section": "0 Introduction and Review",
+ "text": "0 Introduction and Review\nRequired reading below is meant to reengage brain cells which have no doubt forgotten all the material that was covered in STAT 306 or CPSC 340. We don’t presume that you remember all these details, but that, upon rereading, they at least sound familiar. If this all strikes you as completely foreign, this class may not be for you.\n\nRequired reading\n\n[ISLR] 2.1, 2.2, and Chapter 3 (this material is review)\n\nOptional reading\n\n[ESL] 2.4 and 2.6\n\nHandouts\n\nProgramming in R .Rmd, .pdf\n\n\nUsing in RMarkdown .Rmd, .pdf\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n05 Sep 23\n(no class, Imagine UBC)\n\n\n\n07 Sep 23\nIntro to class, Git\n(Quiz 0 due tomorrow)\n\n\n12 Sep 23\nUnderstanding R / Rmd\nLab 00, (Labs begin)\n\n\n14 Sep 23\nLM review, LM Example"
+ },
+ {
+ "objectID": "schedule/index.html#model-accuracy",
+ "href": "schedule/index.html#model-accuracy",
+ "title": " Schedule",
+ "section": "1 Model Accuracy",
+ "text": "1 Model Accuracy\n\nTopics\n\nModel selection; cross validation; information criteria; stepwise regression\n\nRequired reading\n\n[ISLR] Ch 2.2 (not 2.2.3), 5.1 (not 5.1.5), 6.1, 6.4\n\nOptional reading\n\n[ESL] 7.1-7.5, 7.10\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n19 Sep 23\nRegression function, Bias and Variance\n\n\n\n21 Sep 23\nRisk estimation, Info Criteria\n\n\n\n26 Sep 23\nGreedy selection\n\n\n\n28 Sep 23\n\nHW 1 due"
+ },
+ {
+ "objectID": "schedule/index.html#regularization-smoothing-and-trees",
+ "href": "schedule/index.html#regularization-smoothing-and-trees",
+ "title": " Schedule",
+ "section": "2 Regularization, smoothing, and trees",
+ "text": "2 Regularization, smoothing, and trees\n\nTopics\n\nRidge regression, lasso, and related; linear smoothers (splines, kernels); kNN\n\nRequired reading\n\n[ISLR] Ch 6.2, 7.1-7.7.1, 8.1, 8.1.1, 8.1.3, 8.1.4\n\nOptional reading\n\n[ESL] 3.4, 3.8, 5.4, 6.3\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n3 Oct 23\nRidge, Lasso\n\n\n\n5 Oct 23\nCV for comparison, NP 1\n\n\n\n10 Oct 23\nNP 2, Why smoothing?\n\n\n\n12 Oct 23\nNo class (Makeup Monday)\n\n\n\n17 Oct 23\nOther"
+ },
+ {
+ "objectID": "schedule/index.html#classification",
+ "href": "schedule/index.html#classification",
+ "title": " Schedule",
+ "section": "3 Classification",
+ "text": "3 Classification\n\nTopics\n\nlogistic regression; LDA/QDA; naive bayes; trees\n\nRequired reading\n\n[ISLR] Ch 2.2.3, 5.1.5, 4-4.5, 8.1.2\n\nOptional reading\n\n[ESL] 4-4.4, 9.2, 13.3\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n19 Oct 23\nClassification, LDA and QDA\n\n\n\n24 Oct 23\nLogistic regression\nHW 2 due\n\n\n26 Oct 23\nGradient descent, Other losses\n\n\n\n31 Oct 23\nNonlinear"
+ },
+ {
+ "objectID": "schedule/index.html#modern-techniques",
+ "href": "schedule/index.html#modern-techniques",
+ "title": " Schedule",
+ "section": "4 Modern techniques",
+ "text": "4 Modern techniques\n\nTopics\n\nbagging; boosting; random forests; neural networks\n\nRequired reading\n\n[ISLR] 5.2, 8.2, 10.1, 10.2, 10.6, 10.7\n\nOptional reading\n\n[ESL] 10.1-10.10 (skip 10.7), 11.1, 11.3, 11.4, 11.7\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n2 Nov 23\nThe bootstrap\n\n\n\n7 Nov 23\nBagging and random forests, Boosting\nHW 3 due\n\n\n9 Nov 23\nIntro to neural nets\n\n\n\n14 Nov 23\nNo class. (Midterm break)\n\n\n\n16 Nov 23\nEstimating neural nets\n\n\n\n21 Nov 23\nNeural nets wrapup\nHW 4 due"
+ },
+ {
+ "objectID": "schedule/index.html#unsupervised-learning",
+ "href": "schedule/index.html#unsupervised-learning",
+ "title": " Schedule",
+ "section": "5 Unsupervised learning",
+ "text": "5 Unsupervised learning\n\nTopics\n\ndimension reduction and clustering\n\nRequired reading\n\n[ISLR] 12\n\nOptional reading\n\n[ESL] 8.5, 13.2, 14.3, 14.5.1, 14.8, 14.9\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n23 Nov 23\nIntro to PCA, Issues with PCA\n\n\n\n28 Nov 23\nPCA v KPCA\n\n\n\n30 Nov 23\nK means clustering\n\n\n\n5 Dec 23\nHierarchical clustering\n\n\n\n7 Dec 23\n\nHW 5 due"
+ },
+ {
+ "objectID": "schedule/index.html#f-final-exam",
+ "href": "schedule/index.html#f-final-exam",
+ "title": " Schedule",
+ "section": "F Final exam",
+ "text": "F Final exam\nMonday, December 18 at 12-2pm, location TBA\n\n\nIn person attendance is required (per Faculty of Science guidelines)\nYou must bring your computer as the exam will be given through Canvas\nPlease arrange to borrow one from the library if you do not have your own. Let me know ASAP if this may pose a problem.\nYou may bring 2 sheets of front/back 8.5x11 paper with any notes you want to use. No other materials will be allowed.\nThere will be no required coding, but I may show code or output and ask questions about it.\nIt will be entirely multiple choice / True-False / matching, etc. Delivered on Canvas."
+ },
+ {
+ "objectID": "syllabus.html",
+ "href": "syllabus.html",
+ "title": " Syllabus",
"section": "",
- "text": "Version 2023\nThis guide (hopefully) gives enough instructions for recreating new iterations of Stat 406."
+ "text": "Term 2023 Winter 1: 05 Sep - 07 Dec 2023"
},
{
- "objectID": "course-setup.html#github-org",
- "href": "course-setup.html#github-org",
- "title": "Guide for setting up the course infrastructure",
- "section": "Github Org",
- "text": "Github Org\n\nCreate a GitHub.com organization\n\nThis is free for faculty with instructor credentials.\nAllows more comprehensive GitHub actions, PR templates and CODEOWNER behaviour than the UBC Enterprise version\nDownside is getting students added (though we include R scripts for this)\n\nOnce done, go to https://github.com/watching. Click the Red Down arrow “Unwatch all”. Then select this Org. The TAs should do the same.\n\n\nPermissions and structure\nSettings > Member Privileges\nWe list only the important ones.\n\nBase Permissions: No Permission\nRepository creation: None\nRepo forking: None\nPages creation: None\nTeam creation rules: No\n\nBe sure to click save in each area after making changes.\nSettings > Actions > General\nAll repositories: Allow all actions and reusable workflows.\nWorkflow permissions: Read and write permissions.\n\n\nTeams\n\n2 teams, one for the TAs and one for the students\nYou must then manually add the teams to any repos they should access\n\nI generally give the TAs “Write” permission, and the students “Read” permission with some exceptions. See the Repos section below."
+ "objectID": "syllabus.html#course-info",
+ "href": "syllabus.html#course-info",
+ "title": " Syllabus",
+ "section": "Course info",
+ "text": "Course info\nInstructor:\nDaniel McDonald\nOffice: Earth Sciences Building 3106\nWebsite: https://dajmcdon.github.io/\nEmail: daniel@stat.ubc.ca\nSlack: @prof-daniel\nOffice hours:\nMonday (TA), 2-3pm ESB 1045\nThursday/Tuesday (DJM), 10-11am ESB 4182 (the first Tuesday of each month will be moved to Thursday)\nThursday (TA), 3-4pm ESB 3174\nFriday (TA/DJM), 10-11am Zoom (link on Canvas)\nCourse webpage:\nWWW: https://ubc-stat.github.io/stat-406/\nGithub: https://github.com/stat-406-2023\nSee also Canvas\nLectures:\nTue/Thu 0800h - 0930h\n(In person) Earth Sciences Building (ESB) 1012\nTextbooks:\n[ISLR]\n[ESL]\nPrerequisite:\nSTAT 306 or CPSC 340"
},
{
- "objectID": "course-setup.html#repos",
- "href": "course-setup.html#repos",
- "title": "Guide for setting up the course infrastructure",
- "section": "Repos",
- "text": "Repos\nThere are typically about 10 repositories. Homeworks and Labs each have 3 with very similar behaviours.\nBe careful copying directories. All of them have hidden files and folders, e.g. .git. Of particular importance are the .github directories which contain PR templates and GitHub Actions. Also relevant are the .Rprofile files which try to override Student Language settings and avoid unprintible markdown characters.\n\nHomeworks\n\nhomework-solutions\nThis is where most of the work happens. My practice is to create the homework solutions first. I edit these (before school starts) until I’m happy. I then duplicate the file and remove the answers. The result is hwxx-instructions.Rmd. The .gitignore file should ignore all of the solutions and commmit only the instructions. Then, about 1 week after the deadline, I adjust the .gitignore and push the solution files.\n\nStudents have Read permission.\nTAs have Write permission.\nThe preamble.tex file is common to HWs and Labs. It creates a lavender box where the solution will go. This makes life easy for the TAs.\n\n\n\nhomework-solutions-private\nExactly the same as homework-solutions except that all solutions are available from the beginning for TA access. To create this, after I’m satisfied with homework-solutions I copy all files (not the directory) into a new directory, git init then upload to the org. The students never have permission here.\n\n\nhomework-template\nThis is a “template repo” used for creating student specific homework-studentgh repos (using the setup scripts).\nVery Important: copy the hwxx-instructions files over to a new directory. Do NOT copy the directory or you’ll end up with the solutions visible to the students.\nThen rename hwxx-instructions.Rmd to hwxx.Rmd. Now the students have a .pdf with instructions, and a template .Rmd to work on.\nOther important tasks: * The .gitignore is more elaborate in an attempt to avoid students pushing junk into these repos. * The .github directory contains 3 files: CODEOWNERS begins as an empty doc which will be populated with the assigned grader later; pull_request_template.md is used for all HW submission PRs; workflows contains a GH-action to comment on the PR with the date+time when the PR is opened. * Under Settings > General, select “Template repository”. This makes it easier to duplicate to the student repos.\n\n\n\nLabs\nThe three Labs repos operate exactly as the analogous homework repos.\n\nlabs-solutions\nDo any edits here before class begins.\n\n\nlabs-solutions-private\nSame as with the homeworks\n\n\nlabs-template\nSame as with the homeworks\n\n\n\nclicker-solutions\nThis contains the complete set of clicker questions.\nAnswers are hidden in comments on the presentation.\nI release them incrementally after each module (copying over from my clicker deck).\n\n\nopen-pr-log\nThis contains a some GitHub actions to automatically keep track of open PRs for the TAs.\nIt’s still in testing phase, but should work properly. It will create two markdown docs, 1 for labs and 1 for homework. Each shows the assigned TA, the date the PR was opened, and a link to the PR. If everything is configured properly, it should run automatically at 3am every night.\n\nOnly the TAs should have access.\nUnder Settings > Secrets and Variables > Actions you must add a “Repository Secret”. This should be a GitHub Personal Access Token created in your account (Settings > Developer settings > Tokens (classic)). It needs Repo, Workflow, and Admin:Org permissions. I set it to expire at the end of the course. I use it only for this purpose (rather than my other tokens for typical logins).\n\n\n\n.github / .github-private\nThese contains a README that gives some basic information about the available repos and the course. It’s visible Publically, and appears on the Org homepage for all to see. The .github-private has the same function, but applies only to Org members.\n\n\nbakeoff-bakeoff\nThis is for the bonus for HW4. Both TAs and Students have access. I put the TA team as CODEOWNERS and protect the main branch (Settings > Branches > Branch Protection Rules). Here, we “Require approvals” and “Require Review from Code Owners”."
+ "objectID": "syllabus.html#course-objectives",
+ "href": "syllabus.html#course-objectives",
+ "title": " Syllabus",
+ "section": "Course objectives",
+ "text": "Course objectives\nThis is a course in statistical learning methods. Based on the theory of linear models covered in Stat 306, this course will focus on applying many techniques of data analysis to interesting datasets.\nThe course combines analysis with methodology and computational aspects. It treats both the “art” of understanding unfamiliar data and the “science” of analyzing that data in terms of statistical properties. The focus will be on practical aspects of methodology and intuition to help students develop tools for selecting appropriate methods and approaches to problems in their own lives.\nThis is not a “how to program” course, nor a “tour of machine learning methods”. Rather, this course is about how to understand some ML methods. STAT 306 tends to give background in many of the tools of understanding as well as working with already-written R packages. On the other hand, CPSC 340 introduces many methods with a focus on “from-scratch” implementation (in Julia or Python). This course will try to bridge the gap between these approaches. Depending on which course you took, you may be more or less skilled in some aspects than in others. That’s OK and expected.\n\nLearning outcomes\n\nAssess the prediction properties of the supervised learning methods covered in class;\nCorrectly use regularization to improve predictions from linear models, and also to identify important explanatory variables;\nExplain the practical difference between predictions obtained with parametric and non-parametric methods, and decide in specific applications which approach should be used;\nSelect and construct appropriate ensembles to obtain improved predictions in different contexts;\nUse and interpret principal components and other dimension reduction techniques;\nEmploy reasonable coding practices and understand basic R syntax and function.\nWrite reports and use proper version control; engage with standard software."
},
{
- "objectID": "course-setup.html#r-package",
- "href": "course-setup.html#r-package",
- "title": "Guide for setting up the course infrastructure",
- "section": "R package",
- "text": "R package\nThis is hosted at https://github.com/ubc-stat/stat-406-rpackage/. The main purposes are:\n\nDocumentation of datasets used in class, homework, and labs (if not in other R packages)\nProvide a few useful functions.\nInstall all the packages the students need at once, and try to compile LaTeX.\n\nPackage requirements are done manually, unfortunately. Typically, I’ll open the various projects in RStudio and run sort(unique(renv::dependencies()$Package)). It’s not infallible, but works well.\nAll necessary packages should go in “Suggests:” in the DESCRIPTION. This avoids build errors. Note that install via remotes::install_github() then requires dependencies = TRUE."
+ "objectID": "syllabus.html#textbooks",
+ "href": "syllabus.html#textbooks",
+ "title": " Syllabus",
+ "section": "Textbooks",
+ "text": "Textbooks\n\nRequired:\nAn Introduction to Statistical Learning, James, Witten, Hastie, Tibshirani, 2013, Springer, New York. (denoted [ISLR])\nAvailable free online: https://www.statlearning.com\n\n\nOptional (but excellent):\nThe Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2009, Second Edition, Springer, New York. (denoted [ESL])\nAlso available free online: https://web.stanford.edu/~hastie/ElemStatLearn/\nThis second book is a more advanced treatment of a superset of the topics we will cover. If you want to learn more and understand the material more deeply, this is the book for you. All readings from [ESL] are optional."
},
{
- "objectID": "course-setup.html#worksheets",
- "href": "course-setup.html#worksheets",
- "title": "Guide for setting up the course infrastructure",
- "section": "Worksheets",
- "text": "Worksheets\nThese are derived from Matías’s Rmd notes from 2018. They haven’t been updated much.\nThey are hosted at https://github.com/ubc-stat/stat-406-worksheets/.\nI tried requiring them one year. The model was to distribute the R code for the chapters with some random lines removed. Then the students could submit the completed code for small amounts of credit. It didn’t seem to move the needle much and was hard to grade (autograding would be nice here).\nNote that there is a GHaction that automatically renders the book from source and pushes to the gh-pages branch. So local build isn’t necessary and derivative files should not be checked in to version control."
+ "objectID": "syllabus.html#course-assessment-opportunities",
+ "href": "syllabus.html#course-assessment-opportunities",
+ "title": " Syllabus",
+ "section": "Course assessment opportunities",
+ "text": "Course assessment opportunities\n\nEffort-based component\nLabs: [0, 20]\nHomework assignments: [0, 50]\nClickers: [0, 10]\nTotal: min(65, Labs + Homework + Clickers)\n\n\nLabs\nThese are intended to keep you on track. They are to be submitted via pull requests in your personal labs-<username> repo (see the computing tab for descriptions on how to do this).\nLabs typically have a few questions for you to answer or code to implement. These are to be done during lab periods. But you can do them on your own as well. These are worth 2 points each up to a maximum of 20 points. They are due at 2300 on the day of your assigned lab section.\nIf you attend lab, you may share a submission with another student (with acknowledgement on the PR). If you do not attend lab, you must work on your own (subject to the collaboration instructions for Assignments below).\n\nRules.\nYou must submit via PR by the deadline. Your PR must include at least 3 commits. After lab 2, failure to include at least 3 commits will result in a maximum score of 1.\n\n\n\n\n\n\nTip\n\n\n\nIf you attend your lab section, you may work in pairs, submitting a single document to one of your Repos. Be sure to put both names on the document, and mention the collaboration on your PR. You still have until 11pm to submit.\n\n\n\n\nMarking.\nThe overriding theme here is “if you put in the effort, you’ll get all the points.” Grading scheme:\n\n2 if basically all correct\n\n1 if complete but with some major errors, or mostly complete and mostly correct\n\n0 otherwise\n\nYou may submit as many labs as you wish up to 20 total points.\nThere are no appeals on grades.\nIt’s important here to recognize just how important active participation in these activities is. You learn by doing, and this is your opportunity to learn in a low-stakes environment. One thing you’ll learn, for example, is that all animals urinate in 21 seconds.1\n\n\n\nAssignments\nThere will be 5 assignments. These are submitted via pull request similar to the labs but to the homework-<username> repo. Each assignment is worth up to 10 points. They are due by 2300 on the deadline. You must make at least 5 commits. Failure to have at least 5 commits will result in a 25% deduction on HW1 and a 50% deduction thereafter. No exceptions.\nAssignments are typically lightly marked. The median last year was 8/10. But they are not easy. Nor are they short. They often involve a combination of coding, writing, description, and production of statistical graphics.\nAfter receiving a mark and feedback, if you score less than 7, you may make corrections to bring your total to 7. This means, if you fix everything that you did wrong, you get 7. Not 10. The revision must be submitted within 1 week of getting your mark. Only 1 revision per assignment. The TA decision is final. Note that the TAs will only regrade parts you missed, but if you somehow make it worse, they can deduct more points.\nThe revision allowance applies only if you got 3 or more points of “content” deductions. If you missed 3 points for content and 2 more for “penalties” (like insufficient commits, code that runs off the side of the page, etc), then you are ineligible.\n\nPolicy on collaboration on assignments\nDiscussing assignments with your classmates is allowed and encouraged, but it is important that every student get practice working on these problems. This means that all the work you turn in must be your own. The general policy on homework collaboration is:\n\nYou must first make a serious effort to solve the problem.\nIf you are stuck after doing so, you may ask for help from another student. You may discuss strategies to solve the problem, but you may not look at their code, nor may they spell out the solution to you step-by-step.\nOnce you have gotten help, you must write your own solution individually. You must disclose, in your GitHub pull request, the names of anyone from whom you got help.\nThis also applies in reverse: if someone approaches you for help, you must not provide it unless they have already attempted to solve the problem, and you may not share your code or spell out the solution step-by-step.\n\n\n\n\n\n\n\nWarning\n\n\n\nAdherence to the above policy means that identical answers, or nearly identical answers, cannot occur. Thus, such occurrences are violations of the Course’s Academic honesty policy.\n\n\nThese rules also apply to getting help from other people such as friends not in the course (try the problem first, discuss strategies, not step-by-step solutions, acknowledge those from whom you received help).\nYou may not use homework help websites, ChatGPT, Stack Overflow, and so on under any circumstances. The purpose here is to learn. Good faith efforts toward learning are rewarded.\nYou can always, of course, ask me for help on Slack. And public Slack questions are allowed and encouraged.\nYou may also use external sources (books, websites, papers, …) to\n\nLook up programming language documentation, find useful packages, find explanations for error messages, or remind yourself about the syntax for some feature. I do this all the time in the real world. Wikipedia is your friend.\nRead about general approaches to solving specific problems (e.g. a guide to dynamic programming or a tutorial on unit testing in your programming language), or\nClarify material from the course notes or assignments.\n\nBut external sources must be used to support your solution, not to obtain your solution. You may not use them to\n\nFind solutions to the specific problems assigned as homework (in words or in code)—you must independently solve the problem assigned, not translate a solution presented online or elsewhere.\nFind course materials or solutions from this or similar courses from previous years, or\nCopy text or code to use in your submissions without attribution.\n\nIf you use code from online or other sources, you must include code comments identifying the source. It must be clear what code you wrote and what code is from other sources. This rule also applies to text, images, and any other material you submit.\nPlease talk to me if you have any questions about this policy. Any form of plagiarism or cheating will result in sanctions to be determined by me, including grade penalties (such as negative points for the assignment or reductions in letter grade) or course failure. I am obliged to report violations to the appropriate University authorities. See also the text below.\n\n\n\nClickers\nThese are short multiple choice and True / False questions. They happen in class. For each question, correct answers are worth 4, incorrect answers are worth 2. You get 0 points for not answering.\nSuppose there are N total clicker questions, and you have x points. Your final score for this component is\nmax(0, min(5 * x / N - 5, 10)).\nNote that if your average is less than 1, you get 0 points in this component.\n\n\n\n\n\n\nImportant\n\n\n\nIn addition, your final grade in this course will be reduced by 1 full letter grade.\n\n\nThis means that if you did everything else and get a perfect score on the final exam, you will get a 79. Two people did this last year. They were sad.\n\n\n\n\n\n\nWarning\n\n\n\nDON’T DO THIS!!\n\n\nThis may sound harsh, but think about what is required for such a penalty. You’d have to skip more than 50% of class meetings and get every question wrong when you are in class. This is an in-person course. It is not possible to get an A without attending class on a regular basis.\nTo compensate, I will do my best to post recordings of lectures. Past experience has shown 2 things:\n\nYou learn better by attending class than by skipping and “watching”.\nSometimes the technology messes up. So there’s no guarantee that these will be available.\n\nThe purpose is to let you occasionally miss class for any reason with minimal consequences. See also below. If for some reason you need to miss longer streches of time, please contact me or discuss your situation with your Academic Advisor as soon as possible. Don’t wait until December.\n\n\n\nYour score on HW, Labs, and Clickers\nThe total you can accumulate across these 3 components is 65 points. But you can get there however you want. The total available is 80 points. The rest is up to you. But with choice, comes responsibility.\nRules:\n\nNothing dropped.\nNo extensions.\nIf you miss a lab or a HW deadline, then you miss it.\nMake up for missed work somewhere else.\nIf you isolate due to Covid, fine. You miss a few clickers and maybe a lab (though you can do it remotely).\nIf you have a job interview and can’t complete an assignment on time, then skip it.\n\nWe’re not going to police this stuff. You don’t need to let me know. There is no reason that every single person enrolled in this course shouldn’t get > 65 in this class.\nIllustrative scenarios:\n\nDoing 80% on 5 homeworks, coming to class and getting 50% correct, get 2 points on 8 labs gets you 65 points.\nDoing 90% on 5 homeworks, getting 50% correct on all the clickers, averaging 1/2 on all the labs gets you 65 points.\nGoing to all the labs and getting 100%, 100% on 4 homeworks, plus being wrong on every clicker gets you 65 points\n\nChoose your own adventure. Note that the biggest barrier to getting to 65 is skipping the assignments.\n\n\n\n\nFinal exam\n35 points\n\n\nAll multiple choice, T/F, matching.\nThe clickers are the best preparation.\nQuestions may ask you to understand or find mistakes in code.\nNo writing code.\n\nThe Final is very hard. By definition, it cannot be effort-based.\nIt is intended to separate those who really understand the material from those who don’t. Last year, the median was 50%. But if you put in the work (do all the effort points) and get 50%, you get an 83 (an A-). If you put in the work (do all the effort points) and skip the final, you get a 65. You do not have to pass the final to pass the course. You don’t even have to take the final.\nThe point of this scheme is for those who work hard to do well. But only those who really understand the material will get 90+."
},
{
- "objectID": "course-setup.html#course-website-lectures",
- "href": "course-setup.html#course-website-lectures",
- "title": "Guide for setting up the course infrastructure",
- "section": "Course website / lectures",
- "text": "Course website / lectures"
+ "objectID": "syllabus.html#health-issues-and-considerations",
+ "href": "syllabus.html#health-issues-and-considerations",
+ "title": " Syllabus",
+ "section": "Health issues and considerations",
+ "text": "Health issues and considerations\n\nCovid Safety in the Classroom\n\n\n\n\n\n\nImportant\n\n\n\nIf you think you’re sick, stay home no matter what.\n\n\nMasks. Masks are recommended. For our in-person meetings in this class, it is important that all of us feel as comfortable as possible engaging in class activities while sharing an indoor space. Masks are a primary tool to make it harder for Covid-19 to find a new host. Please feel free to wear one or not given your own personal circumstances. Note that there are some people who cannot wear a mask. These individuals are equally welcome in our class.\nVaccination. If you have not yet had a chance to get vaccinated against Covid-19, vaccines are available to you, free. See http://www.vch.ca/covid-19/covid-19-vaccine for help finding an appointment. Boosters will be available later this term. The higher the rate of vaccination in our community overall, the lower the chance of spreading this virus. You are an important part of the UBC community. Please arrange to get vaccinated if you have not already done so. The same goes for Flu.\n\n\nYour personal health\n\n\n\n\n\n\nWarning\n\n\n\nIf you are sick, it’s important that you stay home – no matter what you think you may be sick with (e.g., cold, flu, other).\n\n\n\nDo not come to class if you have Covid symptoms, have recently tested positive for Covid, or are required to quarantine. You can check this website to find out if you should self-isolate or self-monitor: http://www.bccdc.ca/health-info/diseases-conditions/covid-19/self-isolation#Who.\nYour precautions will help reduce risk and keep everyone safer. In this class, the marking scheme is intended to provide flexibility so that you can prioritize your health and still be able to succeed. All work can be completed outside of class with reasonable time allowances.\nIf you do miss class because of illness:\n\nMake a connection early in the term to another student or a group of students in the class. You can help each other by sharing notes. If you don’t yet know anyone in the class, post on the discussion forum to connect with other students.\nConsult the class resources on here and on Canvas. We will post all the slides, readings, and recordings for each class day.\nUse Slack for help.\nCome to virtual office hours.\nSee the marking scheme for reassurance about what flexibility you have. No part of your final grade will be directly impacted by missing class.\n\nIf you are sick on final exam day, do not attend the exam. You must follow up with your home faculty’s advising office to apply for deferred standing. Students who are granted deferred standing write the final exam at a later date. If you’re a Science student, you must apply for deferred standing (an academic concession) through Science Advising no later than 48 hours after the missed final exam/assignment. Learn more and find the application online. For additional information about academic concessions, see the UBC policy here.\n\n\n\n\n\n\n\nNote\n\n\n\nPlease talk with me if you have any concerns or ask me if you are worried about falling behind."
},
{
- "objectID": "course-setup.html#ghclass-package",
- "href": "course-setup.html#ghclass-package",
- "title": "Guide for setting up the course infrastructure",
- "section": "{ghclass} package",
- "text": "{ghclass} package"
+ "objectID": "syllabus.html#university-policies",
+ "href": "syllabus.html#university-policies",
+ "title": " Syllabus",
+ "section": "University policies",
+ "text": "University policies\nUBC provides resources to support student learning and to maintain healthy lifestyles but recognizes that sometimes crises arise and so there are additional resources to access including those for survivors of sexual violence. UBC values respect for the person and ideas of all members of the academic community. Harassment and discrimination are not tolerated nor is suppression of academic freedom. UBC provides appropriate accommodation for students with disabilities and for religious, spiritual and cultural observances. UBC values academic honesty and students are expected to acknowledge the ideas generated by others and to uphold the highest academic standards in all of their actions. Details of the policies and how to access support are available here.\n\nAcademic honesty and standards\nUBC Vancouver Statement\nAcademic honesty is essential to the continued functioning of the University of British Columbia as an institution of higher learning and research. All UBC students are expected to behave as honest and responsible members of an academic community. Breach of those expectations or failure to follow the appropriate policies, principles, rules, and guidelines of the University with respect to academic honesty may result in disciplinary action.\nFor the full statement, please see the 2022/23 Vancouver Academic Calendar\nCourse specific\nSeveral commercial services have approached students regarding selling class notes/study guides to their classmates. Please be advised that selling a faculty member’s notes/study guides individually or on behalf of one of these services using UBC email or Canvas, violates both UBC information technology and UBC intellectual property policy. Selling the faculty member’s notes/study guides to fellow students in this course is not permitted. Violations of this policy will be considered violations of UBC Academic Honesty and Standards and will be reported to the Dean of Science as a violation of course rules. Sanctions for academic misconduct may include a failing grade on the assignment for which the notes/study guides are being sold, a reduction in your final course grade, a failing grade in the course, among other possibilities. Similarly, contracting with any service that results in an individual other than the enrolled student providing assistance on quizzes or exams or posing as an enrolled student is considered a violation of UBC’s academic honesty standards.\nSome of the problems that are assigned are similar or identical to those assigned in previous years by me or other instructors for this or other courses. Using proofs or code from anywhere other than the textbooks, this year’s course notes, or the course website is not only considered cheating (as described above), it is easily detectable cheating. Such behavior is strictly forbidden.\nIn previous years, I have caught students cheating on the exams or assignments. I did not enforce any penalty because the action did not help. Cheating, in my experience, occurs because students don’t understand the material, so the result is usually a failing grade even before I impose any penalty and report the incident to the Dean’s office. I carefully structure exams and assignments to make it so that I can catch these issues. I will catch you, and it does not help. Do your own work, and use the TAs and me as resources. If you are struggling, we are here to help.\n\n\n\n\n\n\nCaution\n\n\n\nIf I suspect cheating, your case will be forwarded to the Dean’s office. No questions asked.\n\n\nGenerative AI\nTools to help you code more quickly are rapidly becoming more prevalent. I use them regularly myself. The point of this course is not to “complete assignments” but to learn coding (and other things). With that goal in mind, I recommend you avoid the use of Generative AI. It is unlikely to contribute directly to your understanding of the material. Furthermore, I have experimented with certain tools on the assignments for this course and have found the results underwhelming.\nThe material in this course is best learned through trial and error. Avoiding this mechanism (with generative AI or by copying your friend) is a short-term solution at best. I have tried to structure this course to discourage these types of short cuts, and minimize the pressure you may feel to take them.\n\n\nAcademic Concessions\nThese are handled according to UBC policy. Please see\n\nUBC student services\nUBC Vancouver Academic Calendar\nFaculty of Science Concessions\n\n\n\nMissed final exam\nStudents who miss the final exam must report to their Faculty advising office within 72 hours of the missed exam, and must supply supporting documentation. Only your Faculty Advising office can grant deferred standing in a course. You must also notify your instructor prior to (if possible) or immediately after the exam. Your instructor will let you know when you are expected to write your deferred exam. Deferred exams will ONLY be provided to students who have applied for and received deferred standing from their Faculty.\n\n\nTake care of yourself\nCourse work at this level can be intense, and I encourage you to take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep and taking some time to relax. This will help you achieve your goals and cope with stress. I struggle with these issues too, and I try hard to set aside time for things that make me happy (cooking, playing/listening to music, exercise, going for walks).\nAll of us benefit from support during times of struggle. If you are having any problems or concerns, do not hesitate to speak with me. There are also many resources available on campus that can provide help and support. Asking for support sooner rather than later is almost always a good idea.\nIf you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, I strongly encourage you to seek support. UBC Counseling Services is here to help: call 604 822 3811 or visit their website. Consider also reaching out to a friend, faculty member, or family member you trust to help get you the support you need.\n\nA dated PDF is available at this link."
},
{
- "objectID": "course-setup.html#canvas",
- "href": "course-setup.html#canvas",
- "title": "Guide for setting up the course infrastructure",
- "section": "Canvas",
- "text": "Canvas\nI use a the shell provided by FoS.\nNothing else goes here, but you have to update all the links.\nTwo Canvas Quizzes: * Quiz 0 collects GitHub accounts, ensures that students read the syllabus. Due in Week 1. * Final Exam is the final * I usually record lectures (automatically) using the classroom tech that automatically uploads. * Update the various links on the Homepage."
+ "objectID": "syllabus.html#footnotes",
+ "href": "syllabus.html#footnotes",
+ "title": " Syllabus",
+ "section": "Footnotes",
+ "text": "Footnotes\n\n\nA careful reading of this paper with the provocative title “Law of Urination: all mammals empty their bladders over the same duration” reveals that the authors actually mean something far less precise. In fact, their claim is more accurately stated as “mammals over 3kg in body weight urinate in 21 seconds with a standard deviation of 13 seconds”. But the accurate characterization is far less publicity-worthy.↩︎"
},
{
- "objectID": "course-setup.html#slack",
- "href": "course-setup.html#slack",
- "title": "Guide for setting up the course infrastructure",
- "section": "Slack",
- "text": "Slack\n\nSet up a free Org. Invite link gets posted to Canvas.\nI add @students.ubc.ca, @ubc.ca, @stat.ubc.ca to the whitelist.\nI also post the invite on Canvas.\nCreate channels before people join. That way you can automatically add everyone to channels all at once. I do one for each module, 1 for code/github, 1 for mechanics. + 1 for the TAs (private)\nClick through all the settings. It’s useful to adjust these a bit."
+ "objectID": "computing/ubuntu.html",
+ "href": "computing/ubuntu.html",
+ "title": " Ubuntu",
+ "section": "",
+ "text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below."
},
{
- "objectID": "course-setup.html#clickers",
- "href": "course-setup.html#clickers",
- "title": "Guide for setting up the course infrastructure",
- "section": "Clickers",
- "text": "Clickers\nSee https://lthub.ubc.ca/guides/iclicker-cloud-instructor-guide/\nI only use “Polling” no “Quizzing” and no “Attendance”\n\nIn clicker Settings > Polling > Sharing. Turn off the Sending (to avoid students doing it at home)\nNo participation points.\n2 points for correct, 2 for answering.\nIntegrations > Set this up with Canvas. Sync the roster. You’ll likely have to repeat this near the Add/Drop Deadline.\nI only sync the total, since I’ll recalibrate later."
+ "objectID": "computing/ubuntu.html#installation-notes",
+ "href": "computing/ubuntu.html#installation-notes",
+ "title": " Ubuntu",
+ "section": "",
+ "text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below."
},
{
- "objectID": "computing/windows.html",
- "href": "computing/windows.html",
- "title": " Windows",
+ "objectID": "computing/ubuntu.html#ubuntu-software-settings",
+ "href": "computing/ubuntu.html#ubuntu-software-settings",
+ "title": " Ubuntu",
+ "section": "Ubuntu software settings",
+ "text": "Ubuntu software settings\nTo ensure that you are installing the right version of the software in this guide, open “Software & Updates” and make sure that the boxes in the screenshot are checked (this is the default configuration)."
+ },
+ {
+ "objectID": "computing/ubuntu.html#github",
+ "href": "computing/ubuntu.html#github",
+ "title": " Ubuntu",
+ "section": "GitHub",
+ "text": "GitHub\nIn Stat 406 we will use the publicly available GitHub.com. If you do not already have an account, please sign up for one at GitHub.com\nSign up for a free account at GitHub.com if you don’t have one already."
+ },
+ {
+ "objectID": "computing/ubuntu.html#git",
+ "href": "computing/ubuntu.html#git",
+ "title": " Ubuntu",
+ "section": "Git",
+ "text": "Git\nWe will be using the command line version of Git as well as Git through RStudio. Some of the Git commands we will use are only available since Git 2.23, so if your Git is older than this version, so if your Git is older than this version, we ask you to update it using the following commands:\nsudo apt update\nsudo apt install git\nYou can check your git version with the following command:\ngit --version\n\n\n\n\n\n\nNote\n\n\n\nIf you run into trouble, please see the Install Git Linux section from Happy Git and GitHub for the useR for additional help or strategies for Git installation.\n\n\n\nConfiguring Git user info\nNext, we need to configure Git by telling it your name and email. To do this, type the following into the terminal (replacing Jane Doe and janedoe@example.com, with your name and email that you used to sign up for GitHub, respectively):\ngit config --global user.name \"Jane Doe\"\ngit config --global user.email janedoe@example.com\n\n\n\n\n\n\nNote\n\n\n\nTo ensure that you haven’t made a typo in any of the above, you can view your global Git configurations by either opening the configuration file in a text editor (e.g. via the command nano ~/.gitconfig) or by typing git config --list --global).\n\n\nIf you have never used Git before, we recommend also setting the default editor:\ngit config --global core.editor nano\nIf you prefer VScode (and know how to set it up) or something else, feel free."
+ },
+ {
+ "objectID": "computing/ubuntu.html#latex",
+ "href": "computing/ubuntu.html#latex",
+ "title": " Ubuntu",
+ "section": "LaTeX",
+ "text": "LaTeX\nIt is possible you already have this installed.\nFirst try the following check in RStudio\nStat406::test_latex_installation()\nIf you see Green checkmarks, then you’re good.\nEven if it fails, follow the instructions, and try it again.\nIf it still fails, proceed with the instructions\n\nWe will install the lightest possible version of LaTeX and its necessary packages as possible so that we can render Jupyter notebooks and R Markdown documents to html and PDF. If you have previously installed LaTeX, please uninstall it before proceeding with these instructions.\nFirst, run the following command to make sure that /usr/local/bin is writable:\nsudo chown -R $(whoami):admin /usr/local/bin\n\n\n\n\n\n\nNote\n\n\n\nYou might be asked to enter your password during installation.\n\n\nNow open RStudio and run the following commands to install the tinytex package and setup tinytex:\ntinytex::install_tinytex()\nYou can check that the installation is working by opening a terminal and asking for the version of latex:\nlatex --version\nYou should see something like this if you were successful:\npdfTeX 3.141592653-2.6-1.40.23 (TeX Live 2022/dev)\nkpathsea version 6.3.4/dev\nCopyright 2021 Han The Thanh (pdfTeX) et al.\nThere is NO warranty. Redistribution of this software is\ncovered by the terms of both the pdfTeX copyright and\nthe Lesser GNU General Public License.\nFor more information about these matters, see the file\nnamed COPYING and the pdfTeX source.\nPrimary author of pdfTeX: Han The Thanh (pdfTeX) et al.\nCompiled with libpng 1.6.37; using libpng 1.6.37\nCompiled with zlib 1.2.11; using zlib 1.2.11\nCompiled with xpdf version 4.03"
+ },
+ {
+ "objectID": "computing/ubuntu.html#github-pat",
+ "href": "computing/ubuntu.html#github-pat",
+ "title": " Ubuntu",
+ "section": "Github PAT",
+ "text": "Github PAT\nYou’re probably familiar with 2-factor authentication for your UBC account or other accounts which is a very secure way to protect sensitive information (in case your password gets exposed). Github uses a Personal Access Token (PAT) for the Command Line Interface (CLI) and RStudio. This is different from the password you use to log in with a web browser. You will have to create one. There are some nice R functions that will help you along, and I find that easiest.\nComplete instructions are in Chapter 9 of Happy Git With R. Here’s the quick version (you need the usethis and gitcreds libraries, which you can install with install.packages(c(\"usethis\", \"gitcreds\"))):\n\nIn the RStudio Console, call usethis::create_github_token() This should open a webbrowser. In the Note field, write what you like, perhaps “Stat 406 token”. Then update the Expiration to any date after December 15. (“No expiration” is fine, though not very secure). Make sure that everything in repo is checked. Leave all other checks as is. Scroll to the bottom and click the green “Generate Token” button.\nThis should now give you a long string to Copy. It often looks like ghp_0asfjhlasdfhlkasjdfhlksajdhf9234u. Copy that. (You would use this instead of the browser password in RStudio when it asks for a password).\nTo store the PAT permanently in R (so you’ll never have to do this again, hopefully) call gitcreds::gitcreds_set() and paste the thing you copied there."
+ },
+ {
+ "objectID": "computing/ubuntu.html#post-installation-notes",
+ "href": "computing/ubuntu.html#post-installation-notes",
+ "title": " Ubuntu",
+ "section": "Post-installation notes",
+ "text": "Post-installation notes\nYou have completed the installation instructions, well done 🙌!"
+ },
+ {
+ "objectID": "computing/ubuntu.html#attributions",
+ "href": "computing/ubuntu.html#attributions",
+ "title": " Ubuntu",
+ "section": "Attributions",
+ "text": "Attributions\nThe DSCI 310 Teaching Team, notably, Anmol Jawandha, Tomas Beuzen, Rodolfo Lourenzutti, Joel Ostblom, Arman Seyed-Ahmadi, Florencia D’Andrea, and Tiffany Timbers."
+ },
+ {
+ "objectID": "computing/mac_arm.html",
+ "href": "computing/mac_arm.html",
+ "title": " MacOS ARM",
"section": "",
"text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below.\nIn all the sections below, if you are presented with the choice to download either a 64-bit (also called x64) or a 32-bit (also called x86) version of the application always choose the 64-bit version."
},
{
- "objectID": "computing/windows.html#installation-notes",
- "href": "computing/windows.html#installation-notes",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#installation-notes",
+ "href": "computing/mac_arm.html#installation-notes",
+ "title": " MacOS ARM",
"section": "",
"text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below.\nIn all the sections below, if you are presented with the choice to download either a 64-bit (also called x64) or a 32-bit (also called x86) version of the application always choose the 64-bit version."
},
{
- "objectID": "computing/windows.html#terminal",
- "href": "computing/windows.html#terminal",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#terminal",
+ "href": "computing/mac_arm.html#terminal",
+ "title": " MacOS ARM",
"section": "Terminal",
- "text": "Terminal\nBy “Terminal” below we mean the command line program called “Terminal”. Note that this is also available Inside RStudio. Either works."
+ "text": "Terminal\nBy “Terminal” below we mean the command line program called “Terminal”. Note that this is also available Inside RStudio. Either works. To easily pull up the Terminal (outside RStudio), Type Cmd + Space then begin typing “Terminal” and press Return."
},
{
- "objectID": "computing/windows.html#github",
- "href": "computing/windows.html#github",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#github",
+ "href": "computing/mac_arm.html#github",
+ "title": " MacOS ARM",
"section": "GitHub",
"text": "GitHub\nIn Stat 406 we will use the publicly available GitHub.com. If you do not already have an account, please sign up for one at GitHub.com\nSign up for a free account at GitHub.com if you don’t have one already."
},
{
- "objectID": "computing/windows.html#git-bash-and-windows-terminal",
- "href": "computing/windows.html#git-bash-and-windows-terminal",
- "title": " Windows",
- "section": "Git, Bash, and Windows Terminal",
- "text": "Git, Bash, and Windows Terminal\nAlthough these three are separate programs, we are including them in the same section here since they are packaged together in the same installer on Windows. Briefly, we will be using the Bash shell to interact with our computers via a command line interface, Git to keep a version history of our files and upload to/download from to GitHub, and Windows Terminal to run the both Bash and Git.\nGo to https://git-scm.com/download/win and download the windows version of git. After the download has finished, run the installer and accept the default configuration for all pages except for the following:\n\nOn the Select Components page, add a Git Bash profile to Windows Terminal.\n\n\nTo install windows terminal visit this link and click Get to open it in Windows Store. Inside the Store, click Get again and then click Install. After installation, click Launch to start Windows Terminal. In the top of the window, you will see the tab bar with one open tab, a plus sign, and a down arrow. Click the down arrow and select Settings (or type the shortcut Ctrl + ,). In the Startup section, click the dropdown menu under Default profile and select Git Bash.\n\nYou can now launch the Windows terminal from the start menu or pin it to the taskbar like any other program (you can read the rest of the article linked above for additional tips if you wish). To make sure everything worked, close down Windows Terminal, and open it again. Git Bash should open by default, the text should be green and purple, and the tab should read MINGW64:/c/Users/$USERNAME (you should also see /c/Users/$USERNAME if you type pwd into the terminal). This screenshot shows what it should look like:\n\n\n\n\n\n\n\nNote\n\n\n\nWhenever we refer to “the terminal” in these installation instructions, we want you to use the Windows Terminal that you just installed with the Git Bash profile. Do not use Windows PowerShell, CMD, or anything else unless explicitly instructed to do so.\n\n\nTo open a new tab you can click the plus sign or use Ctrl + Shift + t (you can close a tab with Ctrl + Shift + w). To copy text from the terminal, you can highlight it with the mouse and then click Ctrl + Shift + c. To paste text you use Ctrl + Shift + v, try it by pasting the following into the terminal to check which version of Bash you just installed:\nbash --version\nThe output should look similar to this:\nGNU bash, version 4.4.23(1)-release (x86_64-pc-sys)\nCopyright (C) 2019 Free Software Foundation, Inc.\nLicense GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\nThis is free software; you are free to change and redistribute it.\nThere is NO WARRANTY, to the extent permitted by law.\n\n\n\n\n\n\nNote\n\n\n\nIf there is a newline (the enter character) in the clipboard when you are pasting into the terminal, you will be asked if you are sure you want to paste since this newline will act as if you pressed enter and run the command. As a guideline you can press Paste anyway unless you are sure you don’t want this to happen.\n\n\nLet’s also check which version of git was installed:\ngit --version\ngit version 2.32.0.windows.2\n\n\n\n\n\n\nNote\n\n\n\nSome of the Git commands we will use are only available since Git 2.23, so make sure your if your Git is at least this version.\n\n\n\nConfiguring Git user info\nNext, we need to configure Git by telling it your name and email. To do this, type the following into the terminal (replacing Jane Doe and janedoe@example.com, with your name and email that you used to sign up for GitHub, respectively):\ngit config --global user.name \"Jane Doe\"\ngit config --global user.email janedoe@example.com\n\n\n\n\n\n\nNote\n\n\n\nTo ensure that you haven’t made a typo in any of the above, you can view your global Git configurations by either opening the configuration file in a text editor (e.g. via the command nano ~/.gitconfig) or by typing git config --list --global).\n\n\nIf you have never used Git before, we recommend also setting the default editor:\ngit config --global core.editor nano\nIf you prefer VScode (and know how to set it up) or something else, feel free."
+ "objectID": "computing/mac_arm.html#git",
+ "href": "computing/mac_arm.html#git",
+ "title": " MacOS ARM",
+ "section": "Git",
+ "text": "Git\nWe will be using the command line version of Git as well as Git through RStudio. Some of the Git commands we will use are only available since Git 2.23, so if your Git is older than this version, we ask you to update it using the Xcode command line tools (not all of Xcode), which includes Git.\nOpen Terminal and type the following command to install Xcode command line tools:\nxcode-select --install\nAfter installation, in terminal type the following to ask for the version:\ngit --version\nyou should see something like this (does not have to be the exact same version) if you were successful:\ngit version 2.32.1 (Apple Git-133)\n\n\n\n\n\n\nNote\n\n\n\nIf you run into trouble, please see the Install Git Mac OS section from Happy Git and GitHub for the useR for additional help or strategies for Git installation.\n\n\n\nConfiguring Git user info\nNext, we need to configure Git by telling it your name and email. To do this, type the following into the terminal (replacing Jane Doe and janedoe@example.com, with your name and email that you used to sign up for GitHub, respectively):\ngit config --global user.name \"Jane Doe\"\ngit config --global user.email janedoe@example.com\n\n\n\n\n\n\nNote\n\n\n\nTo ensure that you haven’t made a typo in any of the above, you can view your global Git configurations by either opening the configuration file in a text editor (e.g. via the command nano ~/.gitconfig) or by typing git config --list --global).\n\n\nIf you have never used Git before, we recommend also setting the default editor:\ngit config --global core.editor nano\nIf you prefer VScode (and know how to set it up) or something else, feel free."
},
{
- "objectID": "computing/windows.html#latex",
- "href": "computing/windows.html#latex",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#latex",
+ "href": "computing/mac_arm.html#latex",
+ "title": " MacOS ARM",
"section": "LaTeX",
- "text": "LaTeX\nIt is possible you already have this installed.\nFirst try the following check in RStudio\nStat406::test_latex_installation()\nIf you see Green checkmarks, then you’re good.\nEven if it fails, follow the instructions, and try it again.\nNote that you might see two error messages regarding lua during the installation, you can safely ignore these, the installation will complete successfully after clicking “OK”.\nIf it still fails, proceed with the instructions\n\nIn RStudio, run the following commands to install the tinytex package and setup tinytex:\ninstall.packages('tinytex')\ntinytex::install_tinytex()\nIn order for Git Bash to be able to find the location of TinyTex, you will need to sign out of Windows and back in again. After doing that, you can check that the installation worked by opening a terminal and asking for the version of latex:\nlatex --version\nYou should see something like this if you were successful:\npdfTeX 3.141592653-2.6-1.40.23 (TeX Live 2021/W32TeX)\nkpathsea version 6.3.3\nCopyright 2021 Han The Thanh (pdfTeX) et al.\nThere is NO warranty. Redistribution of this software is\ncovered by the terms of both the pdfTeX copyright and\nthe Lesser GNU General Public License.\nFor more information about these matters, see the file\nnamed COPYING and the pdfTeX source.\nPrimary author of pdfTeX: Han The Thanh (pdfTeX) et al.\nCompiled with libpng 1.6.37; using libpng 1.6.37\nCompiled with zlib 1.2.11; using zlib 1.2.11\nCompiled with xpdf version 4.03"
+ "text": "LaTeX\nIt is possible you already have this installed.\nFirst try the following check in RStudio\nStat406::test_latex_installation()\nIf you see Green checkmarks, then you’re good.\nEven if it fails, follow the instructions, and try it again.\nIf it stall fails, proceed with the instructions\n\nWe will install the lightest possible version of LaTeX and its necessary packages as possible so that we can render Jupyter notebooks and R Markdown documents to html and PDF. If you have previously installed LaTeX, please uninstall it before proceeding with these instructions.\nFirst, run the following command to make sure that /usr/local/bin is writable:\nsudo chown -R $(whoami):admin /usr/local/bin\n\n\n\n\n\n\nNote\n\n\n\nYou might be asked to enter your password during installation.\n\n\nNow open RStudio and run the following commands to install the tinytex package and setup tinytex:\ntinytex::install_tinytex()\nYou can check that the installation is working by opening a terminal and asking for the version of latex:\nlatex --version\nYou should see something like this if you were successful:\npdfTeX 3.141592653-2.6-1.40.23 (TeX Live 2022/dev)\nkpathsea version 6.3.4/dev\nCopyright 2021 Han The Thanh (pdfTeX) et al.\nThere is NO warranty. Redistribution of this software is\ncovered by the terms of both the pdfTeX copyright and\nthe Lesser GNU General Public License.\nFor more information about these matters, see the file\nnamed COPYING and the pdfTeX source.\nPrimary author of pdfTeX: Han The Thanh (pdfTeX) et al.\nCompiled with libpng 1.6.37; using libpng 1.6.37\nCompiled with zlib 1.2.11; using zlib 1.2.11\nCompiled with xpdf version 4.03"
},
{
- "objectID": "computing/windows.html#github-pat",
- "href": "computing/windows.html#github-pat",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#github-pat",
+ "href": "computing/mac_arm.html#github-pat",
+ "title": " MacOS ARM",
"section": "Github PAT",
"text": "Github PAT\nYou’re probably familiar with 2-factor authentication for your UBC account or other accounts which is a very secure way to protect sensitive information (in case your password gets exposed). Github uses a Personal Access Token (PAT) for the Command Line Interface (CLI) and RStudio. This is different from the password you use to log in with a web browser. You will have to create one. There are some nice R functions that will help you along, and I find that easiest.\nComplete instructions are in Chapter 9 of Happy Git With R. Here’s the quick version (you need the usethis and gitcreds libraries, which you can install with install.packages(c(\"usethis\", \"gitcreds\"))):\n\nIn the RStudio Console, call usethis::create_github_token() This should open a webbrowser. In the Note field, write what you like, perhaps “Stat 406 token”. Then update the Expiration to any date after December 15. (“No expiration” is fine, though not very secure). Make sure that everything in repo is checked. Leave all other checks as is. Scroll to the bottom and click the green “Generate Token” button.\nThis should now give you a long string to Copy. It often looks like ghp_0asfjhlasdfhlkasjdfhlksajdhf9234u. Copy that. (You would use this instead of the browser password in RStudio when it asks for a password).\nTo store the PAT permanently in R (so you’ll never have to do this again, hopefully) call gitcreds::gitcreds_set() and paste the thing you copied there."
},
{
- "objectID": "computing/windows.html#post-installation-notes",
- "href": "computing/windows.html#post-installation-notes",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#post-installation-notes",
+ "href": "computing/mac_arm.html#post-installation-notes",
+ "title": " MacOS ARM",
"section": "Post-installation notes",
"text": "Post-installation notes\nYou have completed the installation instructions, well done 🙌!"
},
{
- "objectID": "computing/windows.html#attributions",
- "href": "computing/windows.html#attributions",
- "title": " Windows",
+ "objectID": "computing/mac_arm.html#attributions",
+ "href": "computing/mac_arm.html#attributions",
+ "title": " MacOS ARM",
"section": "Attributions",
"text": "Attributions\nThe DSCI 310 Teaching Team, notably, Anmol Jawandha, Tomas Beuzen, Rodolfo Lourenzutti, Joel Ostblom, Arman Seyed-Ahmadi, Florencia D’Andrea, and Tiffany Timbers."
},
+ {
+ "objectID": "faq.html",
+ "href": "faq.html",
+ "title": " Frequently asked questions",
+ "section": "",
+ "text": "Complete readings before the material is covered in class, and then review again afterwards.\nParticipate actively in class. If you don’t understand something, I can guarantee no one else does either. I have a Ph.D., and I’ve been doing this for more than 10 years. It’s hard for me to remember what it’s like to be you and what you don’t know. Say something! I want you to learn this stuff, and I love to explain more carefully.\nCome to office hours. Again, I like explaining things.\nTry the Labs again without the help of your classmates.\nRead the examples at the end of the [ISLR] chapters. Try the exercises.\nDo not procrastinate — don’t let a module go by with unanswered questions as it will just make the following module’s material even more difficult to follow.\nDo the Worksheets."
+ },
+ {
+ "objectID": "faq.html#how-do-i-succeed-in-this-class",
+ "href": "faq.html#how-do-i-succeed-in-this-class",
+ "title": " Frequently asked questions",
+ "section": "",
+ "text": "Complete readings before the material is covered in class, and then review again afterwards.\nParticipate actively in class. If you don’t understand something, I can guarantee no one else does either. I have a Ph.D., and I’ve been doing this for more than 10 years. It’s hard for me to remember what it’s like to be you and what you don’t know. Say something! I want you to learn this stuff, and I love to explain more carefully.\nCome to office hours. Again, I like explaining things.\nTry the Labs again without the help of your classmates.\nRead the examples at the end of the [ISLR] chapters. Try the exercises.\nDo not procrastinate — don’t let a module go by with unanswered questions as it will just make the following module’s material even more difficult to follow.\nDo the Worksheets."
+ },
+ {
+ "objectID": "faq.html#git-and-github",
+ "href": "faq.html#git-and-github",
+ "title": " Frequently asked questions",
+ "section": "Git and Github",
+ "text": "Git and Github\n\nHomework/Labs workflow\nRstudio version (uses the Git tab. Usually near Environment/History in the upper right)\n\nMake sure you are on main. Pull in remote changes. Click .\nCreate a new branch by clicking the think that looks kinda like .\nWork on your documents and save frequently.\nStage your changes by clicking the check boxes.\nCommit your changes by clicking Commit.\nRepeat 3-5 as necessary.\nPush to Github \nWhen done, go to Github and open a PR.\nUse the dropdown menu to go back to main and avoid future headaches.\n\nCommand line version\n\n(Optional, but useful. Pull in any remote changes.) git pull\nCreate a new branch git branch -b <name-of-branch>\nWork on your documents and save frequently.\nStage your changes git add <name-of-document1> repeat for each changed document. git add . stages all changed documents.\nCommit your changes git commit -m \"some message that is meaningful\"\nRepeat 3-5 as necessary.\nPush to Github git push. It may suggest a longer form of this command, obey.\nWhen done, go to Github and open a PR.\nSwitch back to main to avoid future headaches. git checkout main.\n\n\n\nAsking for a HW regrade.\n\n\n\n\n\n\nTo be eligible\n\n\n\n\nYou must have received >3 points of deductions to be eligible.\nAnd they must have been for “content”, not penalties.\nIf you fix the errors, you can raise your grade to 7/10.\nYou must make revisions and re-request review within 1 week of your initial review.\n\n\n\n\nGo to the your local branch for this HW. If you don’t remember the right name, you can check the PRs in your repo on GitHub by clicking “Pull Requests” tab. It might be closed.\nMake any changes you need to make to the files, commit and push. Make sure to rerender the .pdf if needed.\nGo to GitHub.com and find the original PR for this assignment. There should now be additional commits since the previous Review.\nAdd a comment to the TA describing the changes you’ve made. Be concise and clear.\nUnder “Reviewers” on the upper right of the screen, you should see a 🔁 button. Once you click that, the TA will be notified to review your changes.\n\n\n\nFixing common problems\n\nmaster/main\n“master” has some pretty painful connotations. So as part of an effort to remove racist names from code, the default branch is now “main” on new versions of GitHub. But old versions (like the UBC version) still have “master”. Below, I’ll use “main”, but if you see “master” on what you’re doing, that’s the one to use.\n\n\nStart from main\nBranches should be created from the main branch, not the one you used for the last assignment.\ngit checkout main\nThis switches to main. Then pull and start the new assignment following the workflow above. (In Rstudio, use the dropdown menu.)\n\n\nYou forgot to work on a new branch\nUgh, you did some labs before realizing you forgot to create a new branch. Don’t stress. There are some things below to try. But if you’re confused ASK. We’ve had practice with this, and soon you will too!\n(1) If you started from main and haven’t made any commits (but you SAVED!!):\ngit branch -b <new-branch-name>\nThis keeps everything you have and puts you on a new branch. No problem. Commit and proceed as usual.\n(2) If you are on main and made some commits:\ngit branch <new-branch-name>\ngit log\nThe first line makes a new branch with all the stuff you’ve done. Then we look at the log. Locate the most recent commit before you started working. It’s a long string like ac2a8365ce0fa220c11e658c98212020fa2ba7d1. Then,\ngit reset ac2a8 --hard\nThis rolls main back to that commit. You don’t need the whole string, just the first few characters. Finally\ngit checkout <new-branch-name>\nand continue working.\n(3) If you started work on <some-old-branch> for work you already submitted:\nThis one is harder, and I would suggest getting in touch with the TAs. Here’s the procedure.\ngit commit -am \"uhoh, I need to be on a different branch\"\ngit branch <new-branch-name>\nCommit your work with a dumb message, then create a new branch. It’s got all your stuff.\ngit log\nLocate the most recent commit before you started working. It’s a long string like ac2a8365ce0fa220c11e658c98212020fa2ba7d1. Then,\ngit rebase --onto main ac2a8 <new-branch-name>\ngit checkout <new-branch-name>\nThis makes the new branch look like main but without the differences from main that are on ac2a8 and WITH all the work you did after ac2a8. It’s pretty cool. And should work. Finally, we switch to our new branch.\n\n\n\nHow can I get better at R?\nI get this question a lot. The answer is almost never “go read the book How to learn R fast” or “watch the video on FreeRadvice.com”. To learn programming, the only thing to do is to program. Do your tutorialls. Redo your tutorials. Run through the code in the textbook. Ask yourself why we used one function instead of another. Ask questions. Play little coding games. If you find yourself wondering how some bit of code works, run through it step by step. Print out the results and see what it’s doing. If you take on these kinds of tasks regularly, you will improve rapidly.\nCoding is an active activity just like learning Spanish. You have to practice constantly. For the same reasons that it is difficult/impossible to learn Spanish just from reading a textbook, it is difficult/impossible to learn R just from reading/watching.\nWhen I took German in 7th grade, I remember my teacher saying “to learn a language, you have to constantly tell lies”. What he meant was, you don’t just say “yesterday I went to the gym”. You say “yesterday I went to the market”, “yesterday I went to the movies”, “today she’s going to the gym”, etc. The point is to internalize conjugation, vocabulary, and the inner workings of the language. The same is true when coding. Do things different ways. Try automating regular tasks.\nRecommended resources\n\nData Science: A first introduction This is the course textbook for UBC’s DSCI 100\nR4DS written by Hadley Wickham and Garrett Grolemund\nDSCI 310 Coursenotes by Tiffany A. Timbers, Joel Ostblom, Florencia D’Andrea, and Rodolfo Lourenzutti\nHappy Git with R by Jenny Bryan\nModern Dive: Statistical Inference via Data Science\nStat545\nGoogle\n\n\n\nMy code doesn’t run. What do I do?\nThis is a constant issue with code, and it happens to everyone. The following is a general workflow for debugging stuck code.\n\nIf the code is running, but not doing what you want, see below.\nRead the Error message. It will give you some important hints. Sometimes these are hard to parse, but that’s ok.\n\n\nset.seed(12345)\ny <- rnorm(10)\nx <- matrix(rnorm(20), 2)\nlinmod <- lm(y ~ x)\n## Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE): variable lengths differ (found for 'x')\n\nThis one is a little difficult. The first stuff before the colon is telling me where the error happened, but I didn’t use a function called model.frame.default. Nonetheless, after the colon it says variable lengths differ. Well y is length 10 and x has 10 rows right? Oh wait, how many rows does x have?\n\nRead the documentation for the function in the error message. For the above, I should try ?matrix.\nGoogle!! If the first few steps didn’t help, copy the error message into Google. This almost always helps. Best to remove any overly specific information first.\nAsk your classmates Slack. In order to ask most effectively, you should probably provide them some idea of how the error happened. See the section on MWEs for how to do this.\nSee me or the TA. Note that it is highly likely that I will ask if you did the above steps first. And I will want to see your minimal working example (MWE).\n\n\n\n\n\n\n\nWarning\n\n\n\nIf you meet with me, be prepared to show me your code! Or message me your MWE. Or both. But not neither.\n\n\nIf the error cannot be reproduced in my presence, it is very unlikely that I can fix it.\n\n\nMinimal working examples\nAn MWE is a small bit of code which will work on anyone’s machine and reproduce the error that you are getting. This is a key component of getting help debugging. When you do your homework, there’s lots of stuff going on that will differ from most other students. To allow them (or me, or the TA) to help you, you need to be able to get their machine to reproduce your error (and only your error) without much hassle.\nI find that, in the process of preparing an MWE, I can often answer my own question. So it is a useful exercise even if you aren’t ready to call in the experts yet. The process of stripping your problem down to its bare essence often reveals where the root issue lies. My above code is an MWE: I set a seed, so we both can use exactly the same data, and it’s only a few lines long without calling any custom code that you don’t have.\nFor a good discussion of how to do this, see the R Lecture or stackexchange.\n\n\nHow to write good code\nThis is covered in much greater detail in the lectures, so see there. Here is my basic advice.\n\nWrite script files (which you save) and source them. Don’t do everything in the console. R (and python and Matlab and SAS) is much better as a scripting language than as a calculator.\nDon’t write anything more than once. This has three corollaries:\n\nIf you are tempted to copy/paste, don’t.\nDon’t use magic numbers. Define all constants at the top of the script.\nWrite functions.\n\nThe third is very important. Functions are easy to test. You give different inputs and check whether the output is as expected. This helps catch mistakes.\nThere are two kinds of errors: syntax and function.\n\nThe first R can find (missing close parenthesis, wrong arguments, etc.)\n\nThe second you can only catch by thorough testing\n\nDon’t use magic numbers.\nUse meaningful names. Don’t do this:\n\ndata(\"ChickWeight\")\nout <- lm(weight ~ Time + Chick + Diet, data = ChickWeight)\n\nComment things that aren’t clear from the (meaningful) names.\nComment long formulas that don’t immediately make sense:\n\ngarbage <- with(\n ChickWeight, \n by(weight, Chick, function(x) (x^2 + 23) / length(x))\n) ## WTF???"
+ },
+ {
+ "objectID": "index.html",
+ "href": "index.html",
+ "title": "Stat 406",
+ "section": "",
+ "text": "Jump to Schedule Syllabus\n\n\nAt the end of the course, you will be able to:\n\nAssess the prediction properties of the supervised learning methods covered in class;\nCorrectly use regularization to improve predictions from linear models, and also to identify important explanatory variables;\nExplain the practical difference between predictions obtained with parametric and non-parametric methods, and decide in specific applications which approach should be used;\nSelect and construct appropriate ensembles to obtain improved predictions in different contexts;\nUse and interpret principal components and other dimension reduction techniques;\nEmploy reasonable coding practices and understand basic R syntax and function.\nWrite reports and use proper version control; engage with standard software."
+ },
+ {
+ "objectID": "computing/index.html",
+ "href": "computing/index.html",
+ "title": " Computing",
+ "section": "",
+ "text": "In order to participate in this class, we will require the use of R, and encourage the use of RStudio. Both are free, and you likely already have both.\nYou also need Git, Github and Slack.\nBelow are instructions for installation. These are edited and simplified from the DSCI 310 Setup Instructions. If you took DSCI 310 last year, you may be good to go, with the exception of the R package."
+ },
+ {
+ "objectID": "computing/index.html#laptop-requirements",
+ "href": "computing/index.html#laptop-requirements",
+ "title": " Computing",
+ "section": "Laptop requirements",
+ "text": "Laptop requirements\n\nRuns one of the following operating systems: Ubuntu 20.04, macOS (version 11.4.x or higher), Windows 10 (version 2004, 20H2, 21H1 or higher).\n\nWhen installing Ubuntu, checking the box “Install third party…” will (among other things) install proprietary drivers, which can be helpful for wifi and graphics cards.\n\nCan connect to networks via a wireless connection for on campus work\nHas at least 30 GB disk space available\nHas at least 4 GB of RAM\nUses a 64-bit CPU\nIs at most 6 years old (4 years old or newer is recommended)\nUses English as the default language. Using other languages is possible, but we have found that it often causes problems in the homework. We’ve done our best to fix them, but we may ask you to change it if you are having trouble.\nStudent user has full administrative access to the computer."
+ },
+ {
+ "objectID": "computing/index.html#software-installation-instructions",
+ "href": "computing/index.html#software-installation-instructions",
+ "title": " Computing",
+ "section": "Software installation instructions",
+ "text": "Software installation instructions\nPlease click the appropriate link below to view the installation instructions for your operating system:\n\nmacOS x86 or macOS arm\nUbuntu\nWindows"
+ },
{
"objectID": "computing/mac_x86.html",
"href": "computing/mac_x86.html",
@@ -1883,2131 +2401,2033 @@
"text": "Attributions\nThe DSCI 310 Teaching Team, notably, Anmol Jawandha, Tomas Beuzen, Rodolfo Lourenzutti, Joel Ostblom, Arman Seyed-Ahmadi, Florencia D’Andrea, and Tiffany Timbers."
},
{
- "objectID": "computing/index.html",
- "href": "computing/index.html",
- "title": " Computing",
- "section": "",
- "text": "In order to participate in this class, we will require the use of R, and encourage the use of RStudio. Both are free, and you likely already have both.\nYou also need Git, Github and Slack.\nBelow are instructions for installation. These are edited and simplified from the DSCI 310 Setup Instructions. If you took DSCI 310 last year, you may be good to go, with the exception of the R package."
- },
- {
- "objectID": "computing/index.html#laptop-requirements",
- "href": "computing/index.html#laptop-requirements",
- "title": " Computing",
- "section": "Laptop requirements",
- "text": "Laptop requirements\n\nRuns one of the following operating systems: Ubuntu 20.04, macOS (version 11.4.x or higher), Windows 10 (version 2004, 20H2, 21H1 or higher).\n\nWhen installing Ubuntu, checking the box “Install third party…” will (among other things) install proprietary drivers, which can be helpful for wifi and graphics cards.\n\nCan connect to networks via a wireless connection for on campus work\nHas at least 30 GB disk space available\nHas at least 4 GB of RAM\nUses a 64-bit CPU\nIs at most 6 years old (4 years old or newer is recommended)\nUses English as the default language. Using other languages is possible, but we have found that it often causes problems in the homework. We’ve done our best to fix them, but we may ask you to change it if you are having trouble.\nStudent user has full administrative access to the computer."
- },
- {
- "objectID": "computing/index.html#software-installation-instructions",
- "href": "computing/index.html#software-installation-instructions",
- "title": " Computing",
- "section": "Software installation instructions",
- "text": "Software installation instructions\nPlease click the appropriate link below to view the installation instructions for your operating system:\n\nmacOS x86 or macOS arm\nUbuntu\nWindows"
- },
- {
- "objectID": "index.html",
- "href": "index.html",
- "title": "Stat 406",
- "section": "",
- "text": "Jump to Schedule Syllabus\n\n\nAt the end of the course, you will be able to:\n\nAssess the prediction properties of the supervised learning methods covered in class;\nCorrectly use regularization to improve predictions from linear models, and also to identify important explanatory variables;\nExplain the practical difference between predictions obtained with parametric and non-parametric methods, and decide in specific applications which approach should be used;\nSelect and construct appropriate ensembles to obtain improved predictions in different contexts;\nUse and interpret principal components and other dimension reduction techniques;\nEmploy reasonable coding practices and understand basic R syntax and function.\nWrite reports and use proper version control; engage with standard software."
- },
- {
- "objectID": "faq.html",
- "href": "faq.html",
- "title": " Frequently asked questions",
- "section": "",
- "text": "Complete readings before the material is covered in class, and then review again afterwards.\nParticipate actively in class. If you don’t understand something, I can guarantee no one else does either. I have a Ph.D., and I’ve been doing this for more than 10 years. It’s hard for me to remember what it’s like to be you and what you don’t know. Say something! I want you to learn this stuff, and I love to explain more carefully.\nCome to office hours. Again, I like explaining things.\nTry the Labs again without the help of your classmates.\nRead the examples at the end of the [ISLR] chapters. Try the exercises.\nDo not procrastinate — don’t let a module go by with unanswered questions as it will just make the following module’s material even more difficult to follow.\nDo the Worksheets."
- },
- {
- "objectID": "faq.html#how-do-i-succeed-in-this-class",
- "href": "faq.html#how-do-i-succeed-in-this-class",
- "title": " Frequently asked questions",
- "section": "",
- "text": "Complete readings before the material is covered in class, and then review again afterwards.\nParticipate actively in class. If you don’t understand something, I can guarantee no one else does either. I have a Ph.D., and I’ve been doing this for more than 10 years. It’s hard for me to remember what it’s like to be you and what you don’t know. Say something! I want you to learn this stuff, and I love to explain more carefully.\nCome to office hours. Again, I like explaining things.\nTry the Labs again without the help of your classmates.\nRead the examples at the end of the [ISLR] chapters. Try the exercises.\nDo not procrastinate — don’t let a module go by with unanswered questions as it will just make the following module’s material even more difficult to follow.\nDo the Worksheets."
- },
- {
- "objectID": "faq.html#git-and-github",
- "href": "faq.html#git-and-github",
- "title": " Frequently asked questions",
- "section": "Git and Github",
- "text": "Git and Github\n\nHomework/Labs workflow\nRstudio version (uses the Git tab. Usually near Environment/History in the upper right)\n\nMake sure you are on main. Pull in remote changes. Click .\nCreate a new branch by clicking the think that looks kinda like .\nWork on your documents and save frequently.\nStage your changes by clicking the check boxes.\nCommit your changes by clicking Commit.\nRepeat 3-5 as necessary.\nPush to Github \nWhen done, go to Github and open a PR.\nUse the dropdown menu to go back to main and avoid future headaches.\n\nCommand line version\n\n(Optional, but useful. Pull in any remote changes.) git pull\nCreate a new branch git branch -b <name-of-branch>\nWork on your documents and save frequently.\nStage your changes git add <name-of-document1> repeat for each changed document. git add . stages all changed documents.\nCommit your changes git commit -m \"some message that is meaningful\"\nRepeat 3-5 as necessary.\nPush to Github git push. It may suggest a longer form of this command, obey.\nWhen done, go to Github and open a PR.\nSwitch back to main to avoid future headaches. git checkout main.\n\n\n\nAsking for a HW regrade.\n\n\n\n\n\n\nTo be eligible\n\n\n\n\nYou must have received >3 points of deductions to be eligible.\nAnd they must have been for “content”, not penalties.\nIf you fix the errors, you can raise your grade to 7/10.\nYou must make revisions and re-request review within 1 week of your initial review.\n\n\n\n\nGo to the your local branch for this HW. If you don’t remember the right name, you can check the PRs in your repo on GitHub by clicking “Pull Requests” tab. It might be closed.\nMake any changes you need to make to the files, commit and push. Make sure to rerender the .pdf if needed.\nGo to GitHub.com and find the original PR for this assignment. There should now be additional commits since the previous Review.\nAdd a comment to the TA describing the changes you’ve made. Be concise and clear.\nUnder “Reviewers” on the upper right of the screen, you should see a 🔁 button. Once you click that, the TA will be notified to review your changes.\n\n\n\nFixing common problems\n\nmaster/main\n“master” has some pretty painful connotations. So as part of an effort to remove racist names from code, the default branch is now “main” on new versions of GitHub. But old versions (like the UBC version) still have “master”. Below, I’ll use “main”, but if you see “master” on what you’re doing, that’s the one to use.\n\n\nStart from main\nBranches should be created from the main branch, not the one you used for the last assignment.\ngit checkout main\nThis switches to main. Then pull and start the new assignment following the workflow above. (In Rstudio, use the dropdown menu.)\n\n\nYou forgot to work on a new branch\nUgh, you did some labs before realizing you forgot to create a new branch. Don’t stress. There are some things below to try. But if you’re confused ASK. We’ve had practice with this, and soon you will too!\n(1) If you started from main and haven’t made any commits (but you SAVED!!):\ngit branch -b <new-branch-name>\nThis keeps everything you have and puts you on a new branch. No problem. Commit and proceed as usual.\n(2) If you are on main and made some commits:\ngit branch <new-branch-name>\ngit log\nThe first line makes a new branch with all the stuff you’ve done. Then we look at the log. Locate the most recent commit before you started working. It’s a long string like ac2a8365ce0fa220c11e658c98212020fa2ba7d1. Then,\ngit reset ac2a8 --hard\nThis rolls main back to that commit. You don’t need the whole string, just the first few characters. Finally\ngit checkout <new-branch-name>\nand continue working.\n(3) If you started work on <some-old-branch> for work you already submitted:\nThis one is harder, and I would suggest getting in touch with the TAs. Here’s the procedure.\ngit commit -am \"uhoh, I need to be on a different branch\"\ngit branch <new-branch-name>\nCommit your work with a dumb message, then create a new branch. It’s got all your stuff.\ngit log\nLocate the most recent commit before you started working. It’s a long string like ac2a8365ce0fa220c11e658c98212020fa2ba7d1. Then,\ngit rebase --onto main ac2a8 <new-branch-name>\ngit checkout <new-branch-name>\nThis makes the new branch look like main but without the differences from main that are on ac2a8 and WITH all the work you did after ac2a8. It’s pretty cool. And should work. Finally, we switch to our new branch.\n\n\n\nHow can I get better at R?\nI get this question a lot. The answer is almost never “go read the book How to learn R fast” or “watch the video on FreeRadvice.com”. To learn programming, the only thing to do is to program. Do your tutorialls. Redo your tutorials. Run through the code in the textbook. Ask yourself why we used one function instead of another. Ask questions. Play little coding games. If you find yourself wondering how some bit of code works, run through it step by step. Print out the results and see what it’s doing. If you take on these kinds of tasks regularly, you will improve rapidly.\nCoding is an active activity just like learning Spanish. You have to practice constantly. For the same reasons that it is difficult/impossible to learn Spanish just from reading a textbook, it is difficult/impossible to learn R just from reading/watching.\nWhen I took German in 7th grade, I remember my teacher saying “to learn a language, you have to constantly tell lies”. What he meant was, you don’t just say “yesterday I went to the gym”. You say “yesterday I went to the market”, “yesterday I went to the movies”, “today she’s going to the gym”, etc. The point is to internalize conjugation, vocabulary, and the inner workings of the language. The same is true when coding. Do things different ways. Try automating regular tasks.\nRecommended resources\n\nData Science: A first introduction This is the course textbook for UBC’s DSCI 100\nR4DS written by Hadley Wickham and Garrett Grolemund\nDSCI 310 Coursenotes by Tiffany A. Timbers, Joel Ostblom, Florencia D’Andrea, and Rodolfo Lourenzutti\nHappy Git with R by Jenny Bryan\nModern Dive: Statistical Inference via Data Science\nStat545\nGoogle\n\n\n\nMy code doesn’t run. What do I do?\nThis is a constant issue with code, and it happens to everyone. The following is a general workflow for debugging stuck code.\n\nIf the code is running, but not doing what you want, see below.\nRead the Error message. It will give you some important hints. Sometimes these are hard to parse, but that’s ok.\n\n\nset.seed(12345)\ny <- rnorm(10)\nx <- matrix(rnorm(20), 2)\nlinmod <- lm(y ~ x)\n## Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE): variable lengths differ (found for 'x')\n\nThis one is a little difficult. The first stuff before the colon is telling me where the error happened, but I didn’t use a function called model.frame.default. Nonetheless, after the colon it says variable lengths differ. Well y is length 10 and x has 10 rows right? Oh wait, how many rows does x have?\n\nRead the documentation for the function in the error message. For the above, I should try ?matrix.\nGoogle!! If the first few steps didn’t help, copy the error message into Google. This almost always helps. Best to remove any overly specific information first.\nAsk your classmates Slack. In order to ask most effectively, you should probably provide them some idea of how the error happened. See the section on MWEs for how to do this.\nSee me or the TA. Note that it is highly likely that I will ask if you did the above steps first. And I will want to see your minimal working example (MWE).\n\n\n\n\n\n\n\nWarning\n\n\n\nIf you meet with me, be prepared to show me your code! Or message me your MWE. Or both. But not neither.\n\n\nIf the error cannot be reproduced in my presence, it is very unlikely that I can fix it.\n\n\nMinimal working examples\nAn MWE is a small bit of code which will work on anyone’s machine and reproduce the error that you are getting. This is a key component of getting help debugging. When you do your homework, there’s lots of stuff going on that will differ from most other students. To allow them (or me, or the TA) to help you, you need to be able to get their machine to reproduce your error (and only your error) without much hassle.\nI find that, in the process of preparing an MWE, I can often answer my own question. So it is a useful exercise even if you aren’t ready to call in the experts yet. The process of stripping your problem down to its bare essence often reveals where the root issue lies. My above code is an MWE: I set a seed, so we both can use exactly the same data, and it’s only a few lines long without calling any custom code that you don’t have.\nFor a good discussion of how to do this, see the R Lecture or stackexchange.\n\n\nHow to write good code\nThis is covered in much greater detail in the lectures, so see there. Here is my basic advice.\n\nWrite script files (which you save) and source them. Don’t do everything in the console. R (and python and Matlab and SAS) is much better as a scripting language than as a calculator.\nDon’t write anything more than once. This has three corollaries:\n\nIf you are tempted to copy/paste, don’t.\nDon’t use magic numbers. Define all constants at the top of the script.\nWrite functions.\n\nThe third is very important. Functions are easy to test. You give different inputs and check whether the output is as expected. This helps catch mistakes.\nThere are two kinds of errors: syntax and function.\n\nThe first R can find (missing close parenthesis, wrong arguments, etc.)\n\nThe second you can only catch by thorough testing\n\nDon’t use magic numbers.\nUse meaningful names. Don’t do this:\n\ndata(\"ChickWeight\")\nout <- lm(weight ~ Time + Chick + Diet, data = ChickWeight)\n\nComment things that aren’t clear from the (meaningful) names.\nComment long formulas that don’t immediately make sense:\n\ngarbage <- with(\n ChickWeight, \n by(weight, Chick, function(x) (x^2 + 23) / length(x))\n) ## WTF???"
- },
- {
- "objectID": "computing/mac_arm.html",
- "href": "computing/mac_arm.html",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html",
+ "href": "computing/windows.html",
+ "title": " Windows",
"section": "",
"text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below.\nIn all the sections below, if you are presented with the choice to download either a 64-bit (also called x64) or a 32-bit (also called x86) version of the application always choose the 64-bit version."
},
{
- "objectID": "computing/mac_arm.html#installation-notes",
- "href": "computing/mac_arm.html#installation-notes",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#installation-notes",
+ "href": "computing/windows.html#installation-notes",
+ "title": " Windows",
"section": "",
"text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below.\nIn all the sections below, if you are presented with the choice to download either a 64-bit (also called x64) or a 32-bit (also called x86) version of the application always choose the 64-bit version."
},
{
- "objectID": "computing/mac_arm.html#terminal",
- "href": "computing/mac_arm.html#terminal",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#terminal",
+ "href": "computing/windows.html#terminal",
+ "title": " Windows",
"section": "Terminal",
- "text": "Terminal\nBy “Terminal” below we mean the command line program called “Terminal”. Note that this is also available Inside RStudio. Either works. To easily pull up the Terminal (outside RStudio), Type Cmd + Space then begin typing “Terminal” and press Return."
+ "text": "Terminal\nBy “Terminal” below we mean the command line program called “Terminal”. Note that this is also available Inside RStudio. Either works."
},
{
- "objectID": "computing/mac_arm.html#github",
- "href": "computing/mac_arm.html#github",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#github",
+ "href": "computing/windows.html#github",
+ "title": " Windows",
"section": "GitHub",
"text": "GitHub\nIn Stat 406 we will use the publicly available GitHub.com. If you do not already have an account, please sign up for one at GitHub.com\nSign up for a free account at GitHub.com if you don’t have one already."
},
{
- "objectID": "computing/mac_arm.html#git",
- "href": "computing/mac_arm.html#git",
- "title": " MacOS ARM",
- "section": "Git",
- "text": "Git\nWe will be using the command line version of Git as well as Git through RStudio. Some of the Git commands we will use are only available since Git 2.23, so if your Git is older than this version, we ask you to update it using the Xcode command line tools (not all of Xcode), which includes Git.\nOpen Terminal and type the following command to install Xcode command line tools:\nxcode-select --install\nAfter installation, in terminal type the following to ask for the version:\ngit --version\nyou should see something like this (does not have to be the exact same version) if you were successful:\ngit version 2.32.1 (Apple Git-133)\n\n\n\n\n\n\nNote\n\n\n\nIf you run into trouble, please see the Install Git Mac OS section from Happy Git and GitHub for the useR for additional help or strategies for Git installation.\n\n\n\nConfiguring Git user info\nNext, we need to configure Git by telling it your name and email. To do this, type the following into the terminal (replacing Jane Doe and janedoe@example.com, with your name and email that you used to sign up for GitHub, respectively):\ngit config --global user.name \"Jane Doe\"\ngit config --global user.email janedoe@example.com\n\n\n\n\n\n\nNote\n\n\n\nTo ensure that you haven’t made a typo in any of the above, you can view your global Git configurations by either opening the configuration file in a text editor (e.g. via the command nano ~/.gitconfig) or by typing git config --list --global).\n\n\nIf you have never used Git before, we recommend also setting the default editor:\ngit config --global core.editor nano\nIf you prefer VScode (and know how to set it up) or something else, feel free."
+ "objectID": "computing/windows.html#git-bash-and-windows-terminal",
+ "href": "computing/windows.html#git-bash-and-windows-terminal",
+ "title": " Windows",
+ "section": "Git, Bash, and Windows Terminal",
+ "text": "Git, Bash, and Windows Terminal\nAlthough these three are separate programs, we are including them in the same section here since they are packaged together in the same installer on Windows. Briefly, we will be using the Bash shell to interact with our computers via a command line interface, Git to keep a version history of our files and upload to/download from to GitHub, and Windows Terminal to run the both Bash and Git.\nGo to https://git-scm.com/download/win and download the windows version of git. After the download has finished, run the installer and accept the default configuration for all pages except for the following:\n\nOn the Select Components page, add a Git Bash profile to Windows Terminal.\n\n\nTo install windows terminal visit this link and click Get to open it in Windows Store. Inside the Store, click Get again and then click Install. After installation, click Launch to start Windows Terminal. In the top of the window, you will see the tab bar with one open tab, a plus sign, and a down arrow. Click the down arrow and select Settings (or type the shortcut Ctrl + ,). In the Startup section, click the dropdown menu under Default profile and select Git Bash.\n\nYou can now launch the Windows terminal from the start menu or pin it to the taskbar like any other program (you can read the rest of the article linked above for additional tips if you wish). To make sure everything worked, close down Windows Terminal, and open it again. Git Bash should open by default, the text should be green and purple, and the tab should read MINGW64:/c/Users/$USERNAME (you should also see /c/Users/$USERNAME if you type pwd into the terminal). This screenshot shows what it should look like:\n\n\n\n\n\n\n\nNote\n\n\n\nWhenever we refer to “the terminal” in these installation instructions, we want you to use the Windows Terminal that you just installed with the Git Bash profile. Do not use Windows PowerShell, CMD, or anything else unless explicitly instructed to do so.\n\n\nTo open a new tab you can click the plus sign or use Ctrl + Shift + t (you can close a tab with Ctrl + Shift + w). To copy text from the terminal, you can highlight it with the mouse and then click Ctrl + Shift + c. To paste text you use Ctrl + Shift + v, try it by pasting the following into the terminal to check which version of Bash you just installed:\nbash --version\nThe output should look similar to this:\nGNU bash, version 4.4.23(1)-release (x86_64-pc-sys)\nCopyright (C) 2019 Free Software Foundation, Inc.\nLicense GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\nThis is free software; you are free to change and redistribute it.\nThere is NO WARRANTY, to the extent permitted by law.\n\n\n\n\n\n\nNote\n\n\n\nIf there is a newline (the enter character) in the clipboard when you are pasting into the terminal, you will be asked if you are sure you want to paste since this newline will act as if you pressed enter and run the command. As a guideline you can press Paste anyway unless you are sure you don’t want this to happen.\n\n\nLet’s also check which version of git was installed:\ngit --version\ngit version 2.32.0.windows.2\n\n\n\n\n\n\nNote\n\n\n\nSome of the Git commands we will use are only available since Git 2.23, so make sure your if your Git is at least this version.\n\n\n\nConfiguring Git user info\nNext, we need to configure Git by telling it your name and email. To do this, type the following into the terminal (replacing Jane Doe and janedoe@example.com, with your name and email that you used to sign up for GitHub, respectively):\ngit config --global user.name \"Jane Doe\"\ngit config --global user.email janedoe@example.com\n\n\n\n\n\n\nNote\n\n\n\nTo ensure that you haven’t made a typo in any of the above, you can view your global Git configurations by either opening the configuration file in a text editor (e.g. via the command nano ~/.gitconfig) or by typing git config --list --global).\n\n\nIf you have never used Git before, we recommend also setting the default editor:\ngit config --global core.editor nano\nIf you prefer VScode (and know how to set it up) or something else, feel free."
},
{
- "objectID": "computing/mac_arm.html#latex",
- "href": "computing/mac_arm.html#latex",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#latex",
+ "href": "computing/windows.html#latex",
+ "title": " Windows",
"section": "LaTeX",
- "text": "LaTeX\nIt is possible you already have this installed.\nFirst try the following check in RStudio\nStat406::test_latex_installation()\nIf you see Green checkmarks, then you’re good.\nEven if it fails, follow the instructions, and try it again.\nIf it stall fails, proceed with the instructions\n\nWe will install the lightest possible version of LaTeX and its necessary packages as possible so that we can render Jupyter notebooks and R Markdown documents to html and PDF. If you have previously installed LaTeX, please uninstall it before proceeding with these instructions.\nFirst, run the following command to make sure that /usr/local/bin is writable:\nsudo chown -R $(whoami):admin /usr/local/bin\n\n\n\n\n\n\nNote\n\n\n\nYou might be asked to enter your password during installation.\n\n\nNow open RStudio and run the following commands to install the tinytex package and setup tinytex:\ntinytex::install_tinytex()\nYou can check that the installation is working by opening a terminal and asking for the version of latex:\nlatex --version\nYou should see something like this if you were successful:\npdfTeX 3.141592653-2.6-1.40.23 (TeX Live 2022/dev)\nkpathsea version 6.3.4/dev\nCopyright 2021 Han The Thanh (pdfTeX) et al.\nThere is NO warranty. Redistribution of this software is\ncovered by the terms of both the pdfTeX copyright and\nthe Lesser GNU General Public License.\nFor more information about these matters, see the file\nnamed COPYING and the pdfTeX source.\nPrimary author of pdfTeX: Han The Thanh (pdfTeX) et al.\nCompiled with libpng 1.6.37; using libpng 1.6.37\nCompiled with zlib 1.2.11; using zlib 1.2.11\nCompiled with xpdf version 4.03"
+ "text": "LaTeX\nIt is possible you already have this installed.\nFirst try the following check in RStudio\nStat406::test_latex_installation()\nIf you see Green checkmarks, then you’re good.\nEven if it fails, follow the instructions, and try it again.\nNote that you might see two error messages regarding lua during the installation, you can safely ignore these, the installation will complete successfully after clicking “OK”.\nIf it still fails, proceed with the instructions\n\nIn RStudio, run the following commands to install the tinytex package and setup tinytex:\ninstall.packages('tinytex')\ntinytex::install_tinytex()\nIn order for Git Bash to be able to find the location of TinyTex, you will need to sign out of Windows and back in again. After doing that, you can check that the installation worked by opening a terminal and asking for the version of latex:\nlatex --version\nYou should see something like this if you were successful:\npdfTeX 3.141592653-2.6-1.40.23 (TeX Live 2021/W32TeX)\nkpathsea version 6.3.3\nCopyright 2021 Han The Thanh (pdfTeX) et al.\nThere is NO warranty. Redistribution of this software is\ncovered by the terms of both the pdfTeX copyright and\nthe Lesser GNU General Public License.\nFor more information about these matters, see the file\nnamed COPYING and the pdfTeX source.\nPrimary author of pdfTeX: Han The Thanh (pdfTeX) et al.\nCompiled with libpng 1.6.37; using libpng 1.6.37\nCompiled with zlib 1.2.11; using zlib 1.2.11\nCompiled with xpdf version 4.03"
},
{
- "objectID": "computing/mac_arm.html#github-pat",
- "href": "computing/mac_arm.html#github-pat",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#github-pat",
+ "href": "computing/windows.html#github-pat",
+ "title": " Windows",
"section": "Github PAT",
"text": "Github PAT\nYou’re probably familiar with 2-factor authentication for your UBC account or other accounts which is a very secure way to protect sensitive information (in case your password gets exposed). Github uses a Personal Access Token (PAT) for the Command Line Interface (CLI) and RStudio. This is different from the password you use to log in with a web browser. You will have to create one. There are some nice R functions that will help you along, and I find that easiest.\nComplete instructions are in Chapter 9 of Happy Git With R. Here’s the quick version (you need the usethis and gitcreds libraries, which you can install with install.packages(c(\"usethis\", \"gitcreds\"))):\n\nIn the RStudio Console, call usethis::create_github_token() This should open a webbrowser. In the Note field, write what you like, perhaps “Stat 406 token”. Then update the Expiration to any date after December 15. (“No expiration” is fine, though not very secure). Make sure that everything in repo is checked. Leave all other checks as is. Scroll to the bottom and click the green “Generate Token” button.\nThis should now give you a long string to Copy. It often looks like ghp_0asfjhlasdfhlkasjdfhlksajdhf9234u. Copy that. (You would use this instead of the browser password in RStudio when it asks for a password).\nTo store the PAT permanently in R (so you’ll never have to do this again, hopefully) call gitcreds::gitcreds_set() and paste the thing you copied there."
},
{
- "objectID": "computing/mac_arm.html#post-installation-notes",
- "href": "computing/mac_arm.html#post-installation-notes",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#post-installation-notes",
+ "href": "computing/windows.html#post-installation-notes",
+ "title": " Windows",
"section": "Post-installation notes",
"text": "Post-installation notes\nYou have completed the installation instructions, well done 🙌!"
},
{
- "objectID": "computing/mac_arm.html#attributions",
- "href": "computing/mac_arm.html#attributions",
- "title": " MacOS ARM",
+ "objectID": "computing/windows.html#attributions",
+ "href": "computing/windows.html#attributions",
+ "title": " Windows",
"section": "Attributions",
"text": "Attributions\nThe DSCI 310 Teaching Team, notably, Anmol Jawandha, Tomas Beuzen, Rodolfo Lourenzutti, Joel Ostblom, Arman Seyed-Ahmadi, Florencia D’Andrea, and Tiffany Timbers."
},
{
- "objectID": "computing/ubuntu.html",
- "href": "computing/ubuntu.html",
- "title": " Ubuntu",
- "section": "",
- "text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below."
- },
- {
- "objectID": "computing/ubuntu.html#installation-notes",
- "href": "computing/ubuntu.html#installation-notes",
- "title": " Ubuntu",
+ "objectID": "course-setup.html",
+ "href": "course-setup.html",
+ "title": "Guide for setting up the course infrastructure",
"section": "",
- "text": "If you have already installed Git, LaTeX, or any of the R packages, you should be OK. However, if you have difficulty with Homework or Labs, we may ask you to uninstall and try again.\nIn order to be able to support you effectively and minimize setup issues and software conflicts, we suggest you install the required software as specified below."
- },
- {
- "objectID": "computing/ubuntu.html#ubuntu-software-settings",
- "href": "computing/ubuntu.html#ubuntu-software-settings",
- "title": " Ubuntu",
- "section": "Ubuntu software settings",
- "text": "Ubuntu software settings\nTo ensure that you are installing the right version of the software in this guide, open “Software & Updates” and make sure that the boxes in the screenshot are checked (this is the default configuration)."
- },
- {
- "objectID": "computing/ubuntu.html#github",
- "href": "computing/ubuntu.html#github",
- "title": " Ubuntu",
- "section": "GitHub",
- "text": "GitHub\nIn Stat 406 we will use the publicly available GitHub.com. If you do not already have an account, please sign up for one at GitHub.com\nSign up for a free account at GitHub.com if you don’t have one already."
- },
- {
- "objectID": "computing/ubuntu.html#git",
- "href": "computing/ubuntu.html#git",
- "title": " Ubuntu",
- "section": "Git",
- "text": "Git\nWe will be using the command line version of Git as well as Git through RStudio. Some of the Git commands we will use are only available since Git 2.23, so if your Git is older than this version, so if your Git is older than this version, we ask you to update it using the following commands:\nsudo apt update\nsudo apt install git\nYou can check your git version with the following command:\ngit --version\n\n\n\n\n\n\nNote\n\n\n\nIf you run into trouble, please see the Install Git Linux section from Happy Git and GitHub for the useR for additional help or strategies for Git installation.\n\n\n\nConfiguring Git user info\nNext, we need to configure Git by telling it your name and email. To do this, type the following into the terminal (replacing Jane Doe and janedoe@example.com, with your name and email that you used to sign up for GitHub, respectively):\ngit config --global user.name \"Jane Doe\"\ngit config --global user.email janedoe@example.com\n\n\n\n\n\n\nNote\n\n\n\nTo ensure that you haven’t made a typo in any of the above, you can view your global Git configurations by either opening the configuration file in a text editor (e.g. via the command nano ~/.gitconfig) or by typing git config --list --global).\n\n\nIf you have never used Git before, we recommend also setting the default editor:\ngit config --global core.editor nano\nIf you prefer VScode (and know how to set it up) or something else, feel free."
- },
- {
- "objectID": "computing/ubuntu.html#latex",
- "href": "computing/ubuntu.html#latex",
- "title": " Ubuntu",
- "section": "LaTeX",
- "text": "LaTeX\nIt is possible you already have this installed.\nFirst try the following check in RStudio\nStat406::test_latex_installation()\nIf you see Green checkmarks, then you’re good.\nEven if it fails, follow the instructions, and try it again.\nIf it still fails, proceed with the instructions\n\nWe will install the lightest possible version of LaTeX and its necessary packages as possible so that we can render Jupyter notebooks and R Markdown documents to html and PDF. If you have previously installed LaTeX, please uninstall it before proceeding with these instructions.\nFirst, run the following command to make sure that /usr/local/bin is writable:\nsudo chown -R $(whoami):admin /usr/local/bin\n\n\n\n\n\n\nNote\n\n\n\nYou might be asked to enter your password during installation.\n\n\nNow open RStudio and run the following commands to install the tinytex package and setup tinytex:\ntinytex::install_tinytex()\nYou can check that the installation is working by opening a terminal and asking for the version of latex:\nlatex --version\nYou should see something like this if you were successful:\npdfTeX 3.141592653-2.6-1.40.23 (TeX Live 2022/dev)\nkpathsea version 6.3.4/dev\nCopyright 2021 Han The Thanh (pdfTeX) et al.\nThere is NO warranty. Redistribution of this software is\ncovered by the terms of both the pdfTeX copyright and\nthe Lesser GNU General Public License.\nFor more information about these matters, see the file\nnamed COPYING and the pdfTeX source.\nPrimary author of pdfTeX: Han The Thanh (pdfTeX) et al.\nCompiled with libpng 1.6.37; using libpng 1.6.37\nCompiled with zlib 1.2.11; using zlib 1.2.11\nCompiled with xpdf version 4.03"
- },
- {
- "objectID": "computing/ubuntu.html#github-pat",
- "href": "computing/ubuntu.html#github-pat",
- "title": " Ubuntu",
- "section": "Github PAT",
- "text": "Github PAT\nYou’re probably familiar with 2-factor authentication for your UBC account or other accounts which is a very secure way to protect sensitive information (in case your password gets exposed). Github uses a Personal Access Token (PAT) for the Command Line Interface (CLI) and RStudio. This is different from the password you use to log in with a web browser. You will have to create one. There are some nice R functions that will help you along, and I find that easiest.\nComplete instructions are in Chapter 9 of Happy Git With R. Here’s the quick version (you need the usethis and gitcreds libraries, which you can install with install.packages(c(\"usethis\", \"gitcreds\"))):\n\nIn the RStudio Console, call usethis::create_github_token() This should open a webbrowser. In the Note field, write what you like, perhaps “Stat 406 token”. Then update the Expiration to any date after December 15. (“No expiration” is fine, though not very secure). Make sure that everything in repo is checked. Leave all other checks as is. Scroll to the bottom and click the green “Generate Token” button.\nThis should now give you a long string to Copy. It often looks like ghp_0asfjhlasdfhlkasjdfhlksajdhf9234u. Copy that. (You would use this instead of the browser password in RStudio when it asks for a password).\nTo store the PAT permanently in R (so you’ll never have to do this again, hopefully) call gitcreds::gitcreds_set() and paste the thing you copied there."
- },
- {
- "objectID": "computing/ubuntu.html#post-installation-notes",
- "href": "computing/ubuntu.html#post-installation-notes",
- "title": " Ubuntu",
- "section": "Post-installation notes",
- "text": "Post-installation notes\nYou have completed the installation instructions, well done 🙌!"
- },
- {
- "objectID": "computing/ubuntu.html#attributions",
- "href": "computing/ubuntu.html#attributions",
- "title": " Ubuntu",
- "section": "Attributions",
- "text": "Attributions\nThe DSCI 310 Teaching Team, notably, Anmol Jawandha, Tomas Beuzen, Rodolfo Lourenzutti, Joel Ostblom, Arman Seyed-Ahmadi, Florencia D’Andrea, and Tiffany Timbers."
+ "text": "Version 2023\nThis guide (hopefully) gives enough instructions for recreating new iterations of Stat 406."
},
{
- "objectID": "syllabus.html",
- "href": "syllabus.html",
- "title": " Syllabus",
- "section": "",
- "text": "Term 2023 Winter 1: 05 Sep - 07 Dec 2023"
+ "objectID": "course-setup.html#github-org",
+ "href": "course-setup.html#github-org",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Github Org",
+ "text": "Github Org\n\nCreate a GitHub.com organization\n\nThis is free for faculty with instructor credentials.\nAllows more comprehensive GitHub actions, PR templates and CODEOWNER behaviour than the UBC Enterprise version\nDownside is getting students added (though we include R scripts for this)\n\nOnce done, go to https://github.com/watching. Click the Red Down arrow “Unwatch all”. Then select this Org. The TAs should do the same.\n\n\nPermissions and structure\nSettings > Member Privileges\nWe list only the important ones.\n\nBase Permissions: No Permission\nRepository creation: None\nRepo forking: None\nPages creation: None\nTeam creation rules: No\n\nBe sure to click save in each area after making changes.\nSettings > Actions > General\nAll repositories: Allow all actions and reusable workflows.\nWorkflow permissions: Read and write permissions.\n\n\nTeams\n\n2 teams, one for the TAs and one for the students\nYou must then manually add the teams to any repos they should access\n\nI generally give the TAs “Write” permission, and the students “Read” permission with some exceptions. See the Repos section below."
},
{
- "objectID": "syllabus.html#course-info",
- "href": "syllabus.html#course-info",
- "title": " Syllabus",
- "section": "Course info",
- "text": "Course info\nInstructor:\nDaniel McDonald\nOffice: Earth Sciences Building 3106\nWebsite: https://dajmcdon.github.io/\nEmail: daniel@stat.ubc.ca\nSlack: @prof-daniel\nOffice hours:\nMonday (TA), 2-3pm ESB 1045\nThursday/Tuesday (DJM), 10-11am ESB 4182 (the first Tuesday of each month will be moved to Thursday)\nThursday (TA), 3-4pm ESB 3174\nFriday (TA/DJM), 10-11am Zoom (link on Canvas)\nCourse webpage:\nWWW: https://ubc-stat.github.io/stat-406/\nGithub: https://github.com/stat-406-2023\nSee also Canvas\nLectures:\nTue/Thu 0800h - 0930h\n(In person) Earth Sciences Building (ESB) 1012\nTextbooks:\n[ISLR]\n[ESL]\nPrerequisite:\nSTAT 306 or CPSC 340"
+ "objectID": "course-setup.html#repos",
+ "href": "course-setup.html#repos",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Repos",
+ "text": "Repos\nThere are typically about 10 repositories. Homeworks and Labs each have 3 with very similar behaviours.\nBe careful copying directories. All of them have hidden files and folders, e.g. .git. Of particular importance are the .github directories which contain PR templates and GitHub Actions. Also relevant are the .Rprofile files which try to override Student Language settings and avoid unprintible markdown characters.\n\nHomeworks\n\nhomework-solutions\nThis is where most of the work happens. My practice is to create the homework solutions first. I edit these (before school starts) until I’m happy. I then duplicate the file and remove the answers. The result is hwxx-instructions.Rmd. The .gitignore file should ignore all of the solutions and commmit only the instructions. Then, about 1 week after the deadline, I adjust the .gitignore and push the solution files.\n\nStudents have Read permission.\nTAs have Write permission.\nThe preamble.tex file is common to HWs and Labs. It creates a lavender box where the solution will go. This makes life easy for the TAs.\n\n\n\nhomework-solutions-private\nExactly the same as homework-solutions except that all solutions are available from the beginning for TA access. To create this, after I’m satisfied with homework-solutions I copy all files (not the directory) into a new directory, git init then upload to the org. The students never have permission here.\n\n\nhomework-template\nThis is a “template repo” used for creating student specific homework-studentgh repos (using the setup scripts).\nVery Important: copy the hwxx-instructions files over to a new directory. Do NOT copy the directory or you’ll end up with the solutions visible to the students.\nThen rename hwxx-instructions.Rmd to hwxx.Rmd. Now the students have a .pdf with instructions, and a template .Rmd to work on.\nOther important tasks: * The .gitignore is more elaborate in an attempt to avoid students pushing junk into these repos. * The .github directory contains 3 files: CODEOWNERS begins as an empty doc which will be populated with the assigned grader later; pull_request_template.md is used for all HW submission PRs; workflows contains a GH-action to comment on the PR with the date+time when the PR is opened. * Under Settings > General, select “Template repository”. This makes it easier to duplicate to the student repos.\n\n\n\nLabs\nThe three Labs repos operate exactly as the analogous homework repos.\n\nlabs-solutions\nDo any edits here before class begins.\n\n\nlabs-solutions-private\nSame as with the homeworks\n\n\nlabs-template\nSame as with the homeworks\n\n\n\nclicker-solutions\nThis contains the complete set of clicker questions.\nAnswers are hidden in comments on the presentation.\nI release them incrementally after each module (copying over from my clicker deck).\n\n\nopen-pr-log\nThis contains a some GitHub actions to automatically keep track of open PRs for the TAs.\nIt’s still in testing phase, but should work properly. It will create two markdown docs, 1 for labs and 1 for homework. Each shows the assigned TA, the date the PR was opened, and a link to the PR. If everything is configured properly, it should run automatically at 3am every night.\n\nOnly the TAs should have access.\nUnder Settings > Secrets and Variables > Actions you must add a “Repository Secret”. This should be a GitHub Personal Access Token created in your account (Settings > Developer settings > Tokens (classic)). It needs Repo, Workflow, and Admin:Org permissions. I set it to expire at the end of the course. I use it only for this purpose (rather than my other tokens for typical logins).\n\n\n\n.github / .github-private\nThese contains a README that gives some basic information about the available repos and the course. It’s visible Publically, and appears on the Org homepage for all to see. The .github-private has the same function, but applies only to Org members.\n\n\nbakeoff-bakeoff\nThis is for the bonus for HW4. Both TAs and Students have access. I put the TA team as CODEOWNERS and protect the main branch (Settings > Branches > Branch Protection Rules). Here, we “Require approvals” and “Require Review from Code Owners”."
},
{
- "objectID": "syllabus.html#course-objectives",
- "href": "syllabus.html#course-objectives",
- "title": " Syllabus",
- "section": "Course objectives",
- "text": "Course objectives\nThis is a course in statistical learning methods. Based on the theory of linear models covered in Stat 306, this course will focus on applying many techniques of data analysis to interesting datasets.\nThe course combines analysis with methodology and computational aspects. It treats both the “art” of understanding unfamiliar data and the “science” of analyzing that data in terms of statistical properties. The focus will be on practical aspects of methodology and intuition to help students develop tools for selecting appropriate methods and approaches to problems in their own lives.\nThis is not a “how to program” course, nor a “tour of machine learning methods”. Rather, this course is about how to understand some ML methods. STAT 306 tends to give background in many of the tools of understanding as well as working with already-written R packages. On the other hand, CPSC 340 introduces many methods with a focus on “from-scratch” implementation (in Julia or Python). This course will try to bridge the gap between these approaches. Depending on which course you took, you may be more or less skilled in some aspects than in others. That’s OK and expected.\n\nLearning outcomes\n\nAssess the prediction properties of the supervised learning methods covered in class;\nCorrectly use regularization to improve predictions from linear models, and also to identify important explanatory variables;\nExplain the practical difference between predictions obtained with parametric and non-parametric methods, and decide in specific applications which approach should be used;\nSelect and construct appropriate ensembles to obtain improved predictions in different contexts;\nUse and interpret principal components and other dimension reduction techniques;\nEmploy reasonable coding practices and understand basic R syntax and function.\nWrite reports and use proper version control; engage with standard software."
+ "objectID": "course-setup.html#r-package",
+ "href": "course-setup.html#r-package",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "R package",
+ "text": "R package\nThis is hosted at https://github.com/ubc-stat/stat-406-rpackage/. The main purposes are:\n\nDocumentation of datasets used in class, homework, and labs (if not in other R packages)\nProvide a few useful functions.\nInstall all the packages the students need at once, and try to compile LaTeX.\n\nPackage requirements are done manually, unfortunately. Typically, I’ll open the various projects in RStudio and run sort(unique(renv::dependencies()$Package)). It’s not infallible, but works well.\nAll necessary packages should go in “Suggests:” in the DESCRIPTION. This avoids build errors. Note that install via remotes::install_github() then requires dependencies = TRUE."
},
{
- "objectID": "syllabus.html#textbooks",
- "href": "syllabus.html#textbooks",
- "title": " Syllabus",
- "section": "Textbooks",
- "text": "Textbooks\n\nRequired:\nAn Introduction to Statistical Learning, James, Witten, Hastie, Tibshirani, 2013, Springer, New York. (denoted [ISLR])\nAvailable free online: https://www.statlearning.com\n\n\nOptional (but excellent):\nThe Elements of Statistical Learning, Hastie, Tibshirani, Friedman, 2009, Second Edition, Springer, New York. (denoted [ESL])\nAlso available free online: https://web.stanford.edu/~hastie/ElemStatLearn/\nThis second book is a more advanced treatment of a superset of the topics we will cover. If you want to learn more and understand the material more deeply, this is the book for you. All readings from [ESL] are optional."
+ "objectID": "course-setup.html#worksheets",
+ "href": "course-setup.html#worksheets",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Worksheets",
+ "text": "Worksheets\nThese are derived from Matías’s Rmd notes from 2018. They haven’t been updated much.\nThey are hosted at https://github.com/ubc-stat/stat-406-worksheets/.\nI tried requiring them one year. The model was to distribute the R code for the chapters with some random lines removed. Then the students could submit the completed code for small amounts of credit. It didn’t seem to move the needle much and was hard to grade (autograding would be nice here).\nNote that there is a GHaction that automatically renders the book from source and pushes to the gh-pages branch. So local build isn’t necessary and derivative files should not be checked in to version control."
},
{
- "objectID": "syllabus.html#course-assessment-opportunities",
- "href": "syllabus.html#course-assessment-opportunities",
- "title": " Syllabus",
- "section": "Course assessment opportunities",
- "text": "Course assessment opportunities\n\nEffort-based component\nLabs: [0, 20]\nHomework assignments: [0, 50]\nClickers: [0, 10]\nTotal: min(65, Labs + Homework + Clickers)\n\n\nLabs\nThese are intended to keep you on track. They are to be submitted via pull requests in your personal labs-<username> repo (see the computing tab for descriptions on how to do this).\nLabs typically have a few questions for you to answer or code to implement. These are to be done during lab periods. But you can do them on your own as well. These are worth 2 points each up to a maximum of 20 points. They are due at 2300 on the day of your assigned lab section.\nIf you attend lab, you may share a submission with another student (with acknowledgement on the PR). If you do not attend lab, you must work on your own (subject to the collaboration instructions for Assignments below).\n\nRules.\nYou must submit via PR by the deadline. Your PR must include at least 3 commits. After lab 2, failure to include at least 3 commits will result in a maximum score of 1.\n\n\n\n\n\n\nTip\n\n\n\nIf you attend your lab section, you may work in pairs, submitting a single document to one of your Repos. Be sure to put both names on the document, and mention the collaboration on your PR. You still have until 11pm to submit.\n\n\n\n\nMarking.\nThe overriding theme here is “if you put in the effort, you’ll get all the points.” Grading scheme:\n\n2 if basically all correct\n\n1 if complete but with some major errors, or mostly complete and mostly correct\n\n0 otherwise\n\nYou may submit as many labs as you wish up to 20 total points.\nThere are no appeals on grades.\nIt’s important here to recognize just how important active participation in these activities is. You learn by doing, and this is your opportunity to learn in a low-stakes environment. One thing you’ll learn, for example, is that all animals urinate in 21 seconds.1\n\n\n\nAssignments\nThere will be 5 assignments. These are submitted via pull request similar to the labs but to the homework-<username> repo. Each assignment is worth up to 10 points. They are due by 2300 on the deadline. You must make at least 5 commits. Failure to have at least 5 commits will result in a 25% deduction on HW1 and a 50% deduction thereafter. No exceptions.\nAssignments are typically lightly marked. The median last year was 8/10. But they are not easy. Nor are they short. They often involve a combination of coding, writing, description, and production of statistical graphics.\nAfter receiving a mark and feedback, if you score less than 7, you may make corrections to bring your total to 7. This means, if you fix everything that you did wrong, you get 7. Not 10. The revision must be submitted within 1 week of getting your mark. Only 1 revision per assignment. The TA decision is final. Note that the TAs will only regrade parts you missed, but if you somehow make it worse, they can deduct more points.\nThe revision allowance applies only if you got 3 or more points of “content” deductions. If you missed 3 points for content and 2 more for “penalties” (like insufficient commits, code that runs off the side of the page, etc), then you are ineligible.\n\nPolicy on collaboration on assignments\nDiscussing assignments with your classmates is allowed and encouraged, but it is important that every student get practice working on these problems. This means that all the work you turn in must be your own. The general policy on homework collaboration is:\n\nYou must first make a serious effort to solve the problem.\nIf you are stuck after doing so, you may ask for help from another student. You may discuss strategies to solve the problem, but you may not look at their code, nor may they spell out the solution to you step-by-step.\nOnce you have gotten help, you must write your own solution individually. You must disclose, in your GitHub pull request, the names of anyone from whom you got help.\nThis also applies in reverse: if someone approaches you for help, you must not provide it unless they have already attempted to solve the problem, and you may not share your code or spell out the solution step-by-step.\n\n\n\n\n\n\n\nWarning\n\n\n\nAdherence to the above policy means that identical answers, or nearly identical answers, cannot occur. Thus, such occurrences are violations of the Course’s Academic honesty policy.\n\n\nThese rules also apply to getting help from other people such as friends not in the course (try the problem first, discuss strategies, not step-by-step solutions, acknowledge those from whom you received help).\nYou may not use homework help websites, ChatGPT, Stack Overflow, and so on under any circumstances. The purpose here is to learn. Good faith efforts toward learning are rewarded.\nYou can always, of course, ask me for help on Slack. And public Slack questions are allowed and encouraged.\nYou may also use external sources (books, websites, papers, …) to\n\nLook up programming language documentation, find useful packages, find explanations for error messages, or remind yourself about the syntax for some feature. I do this all the time in the real world. Wikipedia is your friend.\nRead about general approaches to solving specific problems (e.g. a guide to dynamic programming or a tutorial on unit testing in your programming language), or\nClarify material from the course notes or assignments.\n\nBut external sources must be used to support your solution, not to obtain your solution. You may not use them to\n\nFind solutions to the specific problems assigned as homework (in words or in code)—you must independently solve the problem assigned, not translate a solution presented online or elsewhere.\nFind course materials or solutions from this or similar courses from previous years, or\nCopy text or code to use in your submissions without attribution.\n\nIf you use code from online or other sources, you must include code comments identifying the source. It must be clear what code you wrote and what code is from other sources. This rule also applies to text, images, and any other material you submit.\nPlease talk to me if you have any questions about this policy. Any form of plagiarism or cheating will result in sanctions to be determined by me, including grade penalties (such as negative points for the assignment or reductions in letter grade) or course failure. I am obliged to report violations to the appropriate University authorities. See also the text below.\n\n\n\nClickers\nThese are short multiple choice and True / False questions. They happen in class. For each question, correct answers are worth 4, incorrect answers are worth 2. You get 0 points for not answering.\nSuppose there are N total clicker questions, and you have x points. Your final score for this component is\nmax(0, min(5 * x / N - 5, 10)).\nNote that if your average is less than 1, you get 0 points in this component.\n\n\n\n\n\n\nImportant\n\n\n\nIn addition, your final grade in this course will be reduced by 1 full letter grade.\n\n\nThis means that if you did everything else and get a perfect score on the final exam, you will get a 79. Two people did this last year. They were sad.\n\n\n\n\n\n\nWarning\n\n\n\nDON’T DO THIS!!\n\n\nThis may sound harsh, but think about what is required for such a penalty. You’d have to skip more than 50% of class meetings and get every question wrong when you are in class. This is an in-person course. It is not possible to get an A without attending class on a regular basis.\nTo compensate, I will do my best to post recordings of lectures. Past experience has shown 2 things:\n\nYou learn better by attending class than by skipping and “watching”.\nSometimes the technology messes up. So there’s no guarantee that these will be available.\n\nThe purpose is to let you occasionally miss class for any reason with minimal consequences. See also below. If for some reason you need to miss longer streches of time, please contact me or discuss your situation with your Academic Advisor as soon as possible. Don’t wait until December.\n\n\n\nYour score on HW, Labs, and Clickers\nThe total you can accumulate across these 3 components is 65 points. But you can get there however you want. The total available is 80 points. The rest is up to you. But with choice, comes responsibility.\nRules:\n\nNothing dropped.\nNo extensions.\nIf you miss a lab or a HW deadline, then you miss it.\nMake up for missed work somewhere else.\nIf you isolate due to Covid, fine. You miss a few clickers and maybe a lab (though you can do it remotely).\nIf you have a job interview and can’t complete an assignment on time, then skip it.\n\nWe’re not going to police this stuff. You don’t need to let me know. There is no reason that every single person enrolled in this course shouldn’t get > 65 in this class.\nIllustrative scenarios:\n\nDoing 80% on 5 homeworks, coming to class and getting 50% correct, get 2 points on 8 labs gets you 65 points.\nDoing 90% on 5 homeworks, getting 50% correct on all the clickers, averaging 1/2 on all the labs gets you 65 points.\nGoing to all the labs and getting 100%, 100% on 4 homeworks, plus being wrong on every clicker gets you 65 points\n\nChoose your own adventure. Note that the biggest barrier to getting to 65 is skipping the assignments.\n\n\n\n\nFinal exam\n35 points\n\n\nAll multiple choice, T/F, matching.\nThe clickers are the best preparation.\nQuestions may ask you to understand or find mistakes in code.\nNo writing code.\n\nThe Final is very hard. By definition, it cannot be effort-based.\nIt is intended to separate those who really understand the material from those who don’t. Last year, the median was 50%. But if you put in the work (do all the effort points) and get 50%, you get an 83 (an A-). If you put in the work (do all the effort points) and skip the final, you get a 65. You do not have to pass the final to pass the course. You don’t even have to take the final.\nThe point of this scheme is for those who work hard to do well. But only those who really understand the material will get 90+."
+ "objectID": "course-setup.html#course-website-lectures",
+ "href": "course-setup.html#course-website-lectures",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Course website / lectures",
+ "text": "Course website / lectures"
},
{
- "objectID": "syllabus.html#health-issues-and-considerations",
- "href": "syllabus.html#health-issues-and-considerations",
- "title": " Syllabus",
- "section": "Health issues and considerations",
- "text": "Health issues and considerations\n\nCovid Safety in the Classroom\n\n\n\n\n\n\nImportant\n\n\n\nIf you think you’re sick, stay home no matter what.\n\n\nMasks. Masks are recommended. For our in-person meetings in this class, it is important that all of us feel as comfortable as possible engaging in class activities while sharing an indoor space. Masks are a primary tool to make it harder for Covid-19 to find a new host. Please feel free to wear one or not given your own personal circumstances. Note that there are some people who cannot wear a mask. These individuals are equally welcome in our class.\nVaccination. If you have not yet had a chance to get vaccinated against Covid-19, vaccines are available to you, free. See http://www.vch.ca/covid-19/covid-19-vaccine for help finding an appointment. Boosters will be available later this term. The higher the rate of vaccination in our community overall, the lower the chance of spreading this virus. You are an important part of the UBC community. Please arrange to get vaccinated if you have not already done so. The same goes for Flu.\n\n\nYour personal health\n\n\n\n\n\n\nWarning\n\n\n\nIf you are sick, it’s important that you stay home – no matter what you think you may be sick with (e.g., cold, flu, other).\n\n\n\nDo not come to class if you have Covid symptoms, have recently tested positive for Covid, or are required to quarantine. You can check this website to find out if you should self-isolate or self-monitor: http://www.bccdc.ca/health-info/diseases-conditions/covid-19/self-isolation#Who.\nYour precautions will help reduce risk and keep everyone safer. In this class, the marking scheme is intended to provide flexibility so that you can prioritize your health and still be able to succeed. All work can be completed outside of class with reasonable time allowances.\nIf you do miss class because of illness:\n\nMake a connection early in the term to another student or a group of students in the class. You can help each other by sharing notes. If you don’t yet know anyone in the class, post on the discussion forum to connect with other students.\nConsult the class resources on here and on Canvas. We will post all the slides, readings, and recordings for each class day.\nUse Slack for help.\nCome to virtual office hours.\nSee the marking scheme for reassurance about what flexibility you have. No part of your final grade will be directly impacted by missing class.\n\nIf you are sick on final exam day, do not attend the exam. You must follow up with your home faculty’s advising office to apply for deferred standing. Students who are granted deferred standing write the final exam at a later date. If you’re a Science student, you must apply for deferred standing (an academic concession) through Science Advising no later than 48 hours after the missed final exam/assignment. Learn more and find the application online. For additional information about academic concessions, see the UBC policy here.\n\n\n\n\n\n\n\nNote\n\n\n\nPlease talk with me if you have any concerns or ask me if you are worried about falling behind."
+ "objectID": "course-setup.html#ghclass-package",
+ "href": "course-setup.html#ghclass-package",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "{ghclass} package",
+ "text": "{ghclass} package"
},
{
- "objectID": "syllabus.html#university-policies",
- "href": "syllabus.html#university-policies",
- "title": " Syllabus",
- "section": "University policies",
- "text": "University policies\nUBC provides resources to support student learning and to maintain healthy lifestyles but recognizes that sometimes crises arise and so there are additional resources to access including those for survivors of sexual violence. UBC values respect for the person and ideas of all members of the academic community. Harassment and discrimination are not tolerated nor is suppression of academic freedom. UBC provides appropriate accommodation for students with disabilities and for religious, spiritual and cultural observances. UBC values academic honesty and students are expected to acknowledge the ideas generated by others and to uphold the highest academic standards in all of their actions. Details of the policies and how to access support are available here.\n\nAcademic honesty and standards\nUBC Vancouver Statement\nAcademic honesty is essential to the continued functioning of the University of British Columbia as an institution of higher learning and research. All UBC students are expected to behave as honest and responsible members of an academic community. Breach of those expectations or failure to follow the appropriate policies, principles, rules, and guidelines of the University with respect to academic honesty may result in disciplinary action.\nFor the full statement, please see the 2022/23 Vancouver Academic Calendar\nCourse specific\nSeveral commercial services have approached students regarding selling class notes/study guides to their classmates. Please be advised that selling a faculty member’s notes/study guides individually or on behalf of one of these services using UBC email or Canvas, violates both UBC information technology and UBC intellectual property policy. Selling the faculty member’s notes/study guides to fellow students in this course is not permitted. Violations of this policy will be considered violations of UBC Academic Honesty and Standards and will be reported to the Dean of Science as a violation of course rules. Sanctions for academic misconduct may include a failing grade on the assignment for which the notes/study guides are being sold, a reduction in your final course grade, a failing grade in the course, among other possibilities. Similarly, contracting with any service that results in an individual other than the enrolled student providing assistance on quizzes or exams or posing as an enrolled student is considered a violation of UBC’s academic honesty standards.\nSome of the problems that are assigned are similar or identical to those assigned in previous years by me or other instructors for this or other courses. Using proofs or code from anywhere other than the textbooks, this year’s course notes, or the course website is not only considered cheating (as described above), it is easily detectable cheating. Such behavior is strictly forbidden.\nIn previous years, I have caught students cheating on the exams or assignments. I did not enforce any penalty because the action did not help. Cheating, in my experience, occurs because students don’t understand the material, so the result is usually a failing grade even before I impose any penalty and report the incident to the Dean’s office. I carefully structure exams and assignments to make it so that I can catch these issues. I will catch you, and it does not help. Do your own work, and use the TAs and me as resources. If you are struggling, we are here to help.\n\n\n\n\n\n\nCaution\n\n\n\nIf I suspect cheating, your case will be forwarded to the Dean’s office. No questions asked.\n\n\nGenerative AI\nTools to help you code more quickly are rapidly becoming more prevalent. I use them regularly myself. The point of this course is not to “complete assignments” but to learn coding (and other things). With that goal in mind, I recommend you avoid the use of Generative AI. It is unlikely to contribute directly to your understanding of the material. Furthermore, I have experimented with certain tools on the assignments for this course and have found the results underwhelming.\nThe material in this course is best learned through trial and error. Avoiding this mechanism (with generative AI or by copying your friend) is a short-term solution at best. I have tried to structure this course to discourage these types of short cuts, and minimize the pressure you may feel to take them.\n\n\nAcademic Concessions\nThese are handled according to UBC policy. Please see\n\nUBC student services\nUBC Vancouver Academic Calendar\nFaculty of Science Concessions\n\n\n\nMissed final exam\nStudents who miss the final exam must report to their Faculty advising office within 72 hours of the missed exam, and must supply supporting documentation. Only your Faculty Advising office can grant deferred standing in a course. You must also notify your instructor prior to (if possible) or immediately after the exam. Your instructor will let you know when you are expected to write your deferred exam. Deferred exams will ONLY be provided to students who have applied for and received deferred standing from their Faculty.\n\n\nTake care of yourself\nCourse work at this level can be intense, and I encourage you to take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep and taking some time to relax. This will help you achieve your goals and cope with stress. I struggle with these issues too, and I try hard to set aside time for things that make me happy (cooking, playing/listening to music, exercise, going for walks).\nAll of us benefit from support during times of struggle. If you are having any problems or concerns, do not hesitate to speak with me. There are also many resources available on campus that can provide help and support. Asking for support sooner rather than later is almost always a good idea.\nIf you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, I strongly encourage you to seek support. UBC Counseling Services is here to help: call 604 822 3811 or visit their website. Consider also reaching out to a friend, faculty member, or family member you trust to help get you the support you need.\n\nA dated PDF is available at this link."
+ "objectID": "course-setup.html#canvas",
+ "href": "course-setup.html#canvas",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Canvas",
+ "text": "Canvas\nI use a the shell provided by FoS.\nNothing else goes here, but you have to update all the links.\nTwo Canvas Quizzes: * Quiz 0 collects GitHub accounts, ensures that students read the syllabus. Due in Week 1. * Final Exam is the final * I usually record lectures (automatically) using the classroom tech that automatically uploads. * Update the various links on the Homepage."
},
{
- "objectID": "syllabus.html#footnotes",
- "href": "syllabus.html#footnotes",
- "title": " Syllabus",
- "section": "Footnotes",
- "text": "Footnotes\n\n\nA careful reading of this paper with the provocative title “Law of Urination: all mammals empty their bladders over the same duration” reveals that the authors actually mean something far less precise. In fact, their claim is more accurately stated as “mammals over 3kg in body weight urinate in 21 seconds with a standard deviation of 13 seconds”. But the accurate characterization is far less publicity-worthy.↩︎"
+ "objectID": "course-setup.html#slack",
+ "href": "course-setup.html#slack",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Slack",
+ "text": "Slack\n\nSet up a free Org. Invite link gets posted to Canvas.\nI add @students.ubc.ca, @ubc.ca, @stat.ubc.ca to the whitelist.\nI also post the invite on Canvas.\nCreate channels before people join. That way you can automatically add everyone to channels all at once. I do one for each module, 1 for code/github, 1 for mechanics. + 1 for the TAs (private)\nClick through all the settings. It’s useful to adjust these a bit."
},
{
- "objectID": "schedule/index.html",
- "href": "schedule/index.html",
- "title": " Schedule",
- "section": "",
- "text": "Required readings and lecture videos are listed below for each module. Readings from [ISLR] are always required while those from [ESL] are optional and supplemental."
+ "objectID": "course-setup.html#clickers",
+ "href": "course-setup.html#clickers",
+ "title": "Guide for setting up the course infrastructure",
+ "section": "Clickers",
+ "text": "Clickers\nSee https://lthub.ubc.ca/guides/iclicker-cloud-instructor-guide/\nI only use “Polling” no “Quizzing” and no “Attendance”\n\nIn clicker Settings > Polling > Sharing. Turn off the Sending (to avoid students doing it at home)\nNo participation points.\n2 points for correct, 2 for answering.\nIntegrations > Set this up with Canvas. Sync the roster. You’ll likely have to repeat this near the Add/Drop Deadline.\nI only sync the total, since I’ll recalibrate later."
},
{
- "objectID": "schedule/index.html#introduction-and-review",
- "href": "schedule/index.html#introduction-and-review",
- "title": " Schedule",
- "section": "0 Introduction and Review",
- "text": "0 Introduction and Review\nRequired reading below is meant to reengage brain cells which have no doubt forgotten all the material that was covered in STAT 306 or CPSC 340. We don’t presume that you remember all these details, but that, upon rereading, they at least sound familiar. If this all strikes you as completely foreign, this class may not be for you.\n\nRequired reading\n\n[ISLR] 2.1, 2.2, and Chapter 3 (this material is review)\n\nOptional reading\n\n[ESL] 2.4 and 2.6\n\nHandouts\n\nProgramming in R .Rmd, .pdf\n\n\nUsing in RMarkdown .Rmd, .pdf\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n05 Sep 23\n(no class, Imagine UBC)\n\n\n\n07 Sep 23\nIntro to class, Git\n(Quiz 0 due tomorrow)\n\n\n12 Sep 23\nUnderstanding R / Rmd\nLab 00, (Labs begin)\n\n\n14 Sep 23\nLM review, LM Example"
+ "objectID": "schedule/slides/00-classification-losses.html#meta-lecture",
+ "href": "schedule/slides/00-classification-losses.html#meta-lecture",
+ "title": "UBC Stat406 2023W",
+ "section": "00 Evaluating classifiers",
+ "text": "00 Evaluating classifiers\nStat 406\nDaniel J. McDonald\nLast modified – 16 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/index.html#model-accuracy",
- "href": "schedule/index.html#model-accuracy",
- "title": " Schedule",
- "section": "1 Model Accuracy",
- "text": "1 Model Accuracy\n\nTopics\n\nModel selection; cross validation; information criteria; stepwise regression\n\nRequired reading\n\n[ISLR] Ch 2.2 (not 2.2.3), 5.1 (not 5.1.5), 6.1, 6.4\n\nOptional reading\n\n[ESL] 7.1-7.5, 7.10\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n19 Sep 23\nRegression function, Bias and Variance\n\n\n\n21 Sep 23\nRisk estimation, Info Criteria\n\n\n\n26 Sep 23\nGreedy selection\n\n\n\n28 Sep 23\n\nHW 1 due"
+ "objectID": "schedule/slides/00-classification-losses.html#how-do-we-measure-accuracy",
+ "href": "schedule/slides/00-classification-losses.html#how-do-we-measure-accuracy",
+ "title": "UBC Stat406 2023W",
+ "section": "How do we measure accuracy?",
+ "text": "How do we measure accuracy?\nSo far — 0-1 loss. If correct class, lose 0 else lose 1.\nAsymmetric classification loss — If correct class, lose 0 else lose something.\nFor example, consider facial recognition. Goal is “person OK”, “person has expired passport”, “person is a known terrorist”\n\nIf classify OK, but was terrorist, lose 1,000,000\nIf classify OK, but expired passport, lose 2\nIf classify terrorist, but was OK, lose 100\nIf classify terrorist, but was expired passport, lose 10\netc.\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n [,1] [,2] [,3]\n[1,] 0 2 30\n[2,] 10 0 100\n[3,] 1000000 50000 0"
},
{
- "objectID": "schedule/index.html#regularization-smoothing-and-trees",
- "href": "schedule/index.html#regularization-smoothing-and-trees",
- "title": " Schedule",
- "section": "2 Regularization, smoothing, and trees",
- "text": "2 Regularization, smoothing, and trees\n\nTopics\n\nRidge regression, lasso, and related; linear smoothers (splines, kernels); kNN\n\nRequired reading\n\n[ISLR] Ch 6.2, 7.1-7.7.1, 8.1, 8.1.1, 8.1.3, 8.1.4\n\nOptional reading\n\n[ESL] 3.4, 3.8, 5.4, 6.3\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n3 Oct 23\nRidge, Lasso\n\n\n\n5 Oct 23\nCV for comparison, NP 1\n\n\n\n10 Oct 23\nNP 2, Why smoothing?\n\n\n\n12 Oct 23\nNo class (Makeup Monday)\n\n\n\n17 Oct 23\nOther"
+ "objectID": "schedule/slides/00-classification-losses.html#deviance-loss",
+ "href": "schedule/slides/00-classification-losses.html#deviance-loss",
+ "title": "UBC Stat406 2023W",
+ "section": "Deviance loss",
+ "text": "Deviance loss\nSometimes we output probabilities as well as class labels.\nFor example, logistic regression returns the probability that an observation is in class 1. \\(P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})\\)\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n(Trees “don’t”, neither does KNN, though you could fake it)\n\n\n\nDeviance loss for 2-class classification is \\(-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))\\)\n\n(Technically, it’s the difference between this and the loss of the null model, but people play fast and loose)\n\nCould also use cross entropy or Gini index."
},
{
- "objectID": "schedule/index.html#classification",
- "href": "schedule/index.html#classification",
- "title": " Schedule",
- "section": "3 Classification",
- "text": "3 Classification\n\nTopics\n\nlogistic regression; LDA/QDA; naive bayes; trees\n\nRequired reading\n\n[ISLR] Ch 2.2.3, 5.1.5, 4-4.5, 8.1.2\n\nOptional reading\n\n[ESL] 4-4.4, 9.2, 13.3\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n19 Oct 23\nClassification, LDA and QDA\n\n\n\n24 Oct 23\nLogistic regression\nHW 2 due\n\n\n26 Oct 23\nGradient descent, Other losses\n\n\n\n31 Oct 23\nNonlinear"
+ "objectID": "schedule/slides/00-classification-losses.html#calibration",
+ "href": "schedule/slides/00-classification-losses.html#calibration",
+ "title": "UBC Stat406 2023W",
+ "section": "Calibration",
+ "text": "Calibration\nSuppose we predict some probabilities for our data, how often do those events happen?\nIn principle, if we predict \\(\\hat{p}(x_i)=0.2\\) for a bunch of events observations \\(i\\), we’d like to see about 20% 1 and 80% 0. (In training set and test set)\nThe same goes for the other probabilities. If we say “20% chance of rain” it should rain 20% of such days.\nOf course, we didn’t predict exactly \\(\\hat{p}(x_i)=0.2\\) ever, so lets look at \\([.15, .25]\\).\n\nn <- 250\ndat <- tibble(\n x = seq(-5, 5, length.out = n),\n p = 1 / (1 + exp(-x)),\n y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |>\n filter(phat > .15, phat < .25) |>\n summarize(target = .2, obs = mean(y))\n\n\n\n# A tibble: 1 × 2\n target obs\n <dbl> <dbl>\n1 0.2 0.222"
},
{
- "objectID": "schedule/index.html#modern-techniques",
- "href": "schedule/index.html#modern-techniques",
- "title": " Schedule",
- "section": "4 Modern techniques",
- "text": "4 Modern techniques\n\nTopics\n\nbagging; boosting; random forests; neural networks\n\nRequired reading\n\n[ISLR] 5.2, 8.2, 10.1, 10.2, 10.6, 10.7\n\nOptional reading\n\n[ESL] 10.1-10.10 (skip 10.7), 11.1, 11.3, 11.4, 11.7\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n2 Nov 23\nThe bootstrap\n\n\n\n7 Nov 23\nBagging and random forests, Boosting\nHW 3 due\n\n\n9 Nov 23\nIntro to neural nets\n\n\n\n14 Nov 23\nNo class. (Midterm break)\n\n\n\n16 Nov 23\nEstimating neural nets\n\n\n\n21 Nov 23\nNeural nets wrapup\nHW 4 due"
+ "objectID": "schedule/slides/00-classification-losses.html#calibration-plot",
+ "href": "schedule/slides/00-classification-losses.html#calibration-plot",
+ "title": "UBC Stat406 2023W",
+ "section": "Calibration plot",
+ "text": "Calibration plot\n\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n dat <- tibble(y = y, phat = phat) |>\n mutate(bins = cut_number(phat, n = nbreaks))\n midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n sum_dat <- dat |>\n group_by(bins) |>\n summarise(\n p = mean(y, na.rm = TRUE),\n se = sqrt(p * (1 - p) / n())\n ) |>\n arrange(p)\n sum_dat$x <- midpts\n\n ggplot(sum_dat, aes(x = x)) +\n geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +\n geom_point(aes(y = p), colour = blue) +\n geom_abline(slope = 1, intercept = 0, colour = orange) +\n ylab(\"observed frequency\") +\n xlab(\"average predicted probability\") +\n coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n geom_rug(data = dat, aes(x = phat), sides = \"b\")\n}"
},
{
- "objectID": "schedule/index.html#unsupervised-learning",
- "href": "schedule/index.html#unsupervised-learning",
- "title": " Schedule",
- "section": "5 Unsupervised learning",
- "text": "5 Unsupervised learning\n\nTopics\n\ndimension reduction and clustering\n\nRequired reading\n\n[ISLR] 12\n\nOptional reading\n\n[ESL] 8.5, 13.2, 14.3, 14.5.1, 14.8, 14.9\n\n\n\n\n\nDate\nSlides\nDeadlines\n\n\n\n\n23 Nov 23\nIntro to PCA, Issues with PCA\n\n\n\n28 Nov 23\nPCA v KPCA\n\n\n\n30 Nov 23\nK means clustering\n\n\n\n5 Dec 23\nHierarchical clustering\n\n\n\n7 Dec 23\n\nHW 5 due"
+ "objectID": "schedule/slides/00-classification-losses.html#amazingly-well-calibrated",
+ "href": "schedule/slides/00-classification-losses.html#amazingly-well-calibrated",
+ "title": "UBC Stat406 2023W",
+ "section": "Amazingly well-calibrated",
+ "text": "Amazingly well-calibrated\n\nbinary_calibration_plot(dat$y, dat$phat, 20L)"
},
{
- "objectID": "schedule/index.html#f-final-exam",
- "href": "schedule/index.html#f-final-exam",
- "title": " Schedule",
- "section": "F Final exam",
- "text": "F Final exam\nMonday, December 18 at 12-2pm, location TBA\n\n\nIn person attendance is required (per Faculty of Science guidelines)\nYou must bring your computer as the exam will be given through Canvas\nPlease arrange to borrow one from the library if you do not have your own. Let me know ASAP if this may pose a problem.\nYou may bring 2 sheets of front/back 8.5x11 paper with any notes you want to use. No other materials will be allowed.\nThere will be no required coding, but I may show code or output and ask questions about it.\nIt will be entirely multiple choice / True-False / matching, etc. Delivered on Canvas."
+ "objectID": "schedule/slides/00-classification-losses.html#less-well-calibrated",
+ "href": "schedule/slides/00-classification-losses.html#less-well-calibrated",
+ "title": "UBC Stat406 2023W",
+ "section": "Less well-calibrated",
+ "text": "Less well-calibrated"
},
{
- "objectID": "schedule/slides/00-cv-for-many-models.html#meta-lecture",
- "href": "schedule/slides/00-cv-for-many-models.html#meta-lecture",
+ "objectID": "schedule/slides/00-classification-losses.html#true-positive-false-negative-sensitivity-specificity",
+ "href": "schedule/slides/00-classification-losses.html#true-positive-false-negative-sensitivity-specificity",
"title": "UBC Stat406 2023W",
- "section": "00 CV for many models",
- "text": "00 CV for many models\nStat 406\nDaniel J. McDonald\nLast modified – 19 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "True positive, false negative, sensitivity, specificity",
+ "text": "True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n\n# correct predict positive / # actual positive (1 - FNR)\n\nFalse negative rate\n\n# incorrect predict negative / # actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n\n# correct predict negative / # actual negative\n\nFalse positive rate\n\n# incorrect predict positive / # actual negative (1 - TNR), Type I Error\n\nSensitivity\n\nTPR, 1 - Type II error\n\nSpecificity\n\nTNR, 1 - Type I error"
},
{
- "objectID": "schedule/slides/00-cv-for-many-models.html#some-data-and-4-models",
- "href": "schedule/slides/00-cv-for-many-models.html#some-data-and-4-models",
+ "objectID": "schedule/slides/00-classification-losses.html#roc-and-thresholds",
+ "href": "schedule/slides/00-classification-losses.html#roc-and-thresholds",
"title": "UBC Stat406 2023W",
- "section": "Some data and 4 models",
- "text": "Some data and 4 models\n\ndata(\"mobility\", package = \"Stat406\")\n\nModel 1: Lasso on all predictors, use CV min\nModel 2: Ridge on all predictors, use CV min\nModel 3: OLS on all predictors (no tuning parameters)\nModel 4: (1) Lasso on all predictors, then (2) OLS on those chosen at CV min\n\nHow do I decide between these 4 models?"
+ "section": "ROC and thresholds",
+ "text": "ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n\nTPR (sensitivity) vs. FPR (1 - specificity)\n\nAUC (Area under the curve)\n\nIntegral of ROC. Closer to 1 is better.\n\n\nSo far, we’ve been thresholding at 0.5, though you shouldn’t always do that.\nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\nTo make the ROC we look at our errors as we vary the cutoff"
},
{
- "objectID": "schedule/slides/00-cv-for-many-models.html#cv-functions",
- "href": "schedule/slides/00-cv-for-many-models.html#cv-functions",
+ "objectID": "schedule/slides/00-classification-losses.html#roc-curve",
+ "href": "schedule/slides/00-classification-losses.html#roc-curve",
"title": "UBC Stat406 2023W",
- "section": "CV functions",
- "text": "CV functions\n\nkfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {\n fold_labels <- sample(rep(seq_len(kfolds), length.out = nrow(data)))\n errors <- double(kfolds)\n for (fold in seq_len(kfolds)) {\n test_rows <- fold_labels == fold\n train <- data[!test_rows, ]\n test <- data[test_rows, ]\n current_model <- estimator(train)\n test$.preds <- predictor(current_model, test)\n errors[fold] <- error_fun(test)\n }\n mean(errors)\n}\n\nloo_cv <- function(dat) {\n mdl <- lm(Mobility ~ ., data = dat)\n mean( abs(residuals(mdl)) / abs(1 - hatvalues(mdl)) ) # MAE version\n}"
+ "section": "ROC curve",
+ "text": "ROC curve\n\n\nroc <- function(prediction, y) {\n op <- order(prediction, decreasing = TRUE)\n preds <- prediction[op]\n y <- y[op]\n noty <- 1 - y\n if (any(duplicated(preds))) {\n y <- rev(tapply(y, preds, sum))\n noty <- rev(tapply(noty, preds, sum))\n }\n tibble(\n FPR = cumsum(noty) / sum(noty),\n TPR = cumsum(y) / sum(y)\n )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n geom_step(colour = blue, size = 2) +\n geom_abline(slope = 1, intercept = 0)"
},
{
- "objectID": "schedule/slides/00-cv-for-many-models.html#experiment-setup",
- "href": "schedule/slides/00-cv-for-many-models.html#experiment-setup",
+ "objectID": "schedule/slides/00-classification-losses.html#other-stuff",
+ "href": "schedule/slides/00-classification-losses.html#other-stuff",
"title": "UBC Stat406 2023W",
- "section": "Experiment setup",
- "text": "Experiment setup\n\n# prepare our data\n# note that mob has only continuous predictors, otherwise could be trouble\nmob <- mobility[complete.cases(mobility), ] |> select(-ID, -State, -Name)\n# avoid doing this same operation a bunch\nxmat <- function(dat) dat |> select(!Mobility) |> as.matrix()\n\n# set up our model functions\nlibrary(glmnet)\nmod1 <- function(dat, ...) cv.glmnet(xmat(dat), dat$Mobility, type.measure = \"mae\", ...)\nmod2 <- function(dat, ...) cv.glmnet(xmat(dat), dat$Mobility, alpha = 0, type.measure = \"mae\", ...)\nmod3 <- function(dat, ...) glmnet(xmat(dat), dat$Mobility, lambda = 0, ...) # just does lm()\nmod4 <- function(dat, ...) cv.glmnet(xmat(dat), dat$Mobility, relax = TRUE, gamma = 1, type.measure = \"mae\", ...)\n\n# this will still \"work\" on mod3, because there's only 1 s\npredictor <- function(mod, dat) drop(predict(mod, newx = xmat(dat), s = \"lambda.min\"))\n\n# chose mean absolute error just 'cause\nerror_fun <- function(testdata) mean(abs(testdata$Mobility - testdata$.preds))"
+ "section": "Other stuff",
+ "text": "Other stuff\n\n\nSource: worth exploring Wikipedia\n\n\n\nUBC Stat 406 - 2023"
},
{
- "objectID": "schedule/slides/00-cv-for-many-models.html#run-the-experiment",
- "href": "schedule/slides/00-cv-for-many-models.html#run-the-experiment",
+ "objectID": "schedule/slides/00-gradient-descent.html#meta-lecture",
+ "href": "schedule/slides/00-gradient-descent.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Run the experiment",
- "text": "Run the experiment\n\nall_model_funs <- lst(mod1, mod2, mod3, mod4)\nall_fits <- map(all_model_funs, .f = exec, dat = mob)\n\n# unfortunately, does different splits for each method, so we use 10, \n# it would be better to use the _SAME_ splits\nten_fold_cv <- map_dbl(all_model_funs, ~ kfold_cv(mob, .x, predictor, error_fun, 10)) \n\nin_sample_cv <- c(\n mod1 = min(all_fits[[1]]$cvm),\n mod2 = min(all_fits[[2]]$cvm),\n mod3 = loo_cv(mob),\n mod4 = min(all_fits[[4]]$cvm)\n)\n\ntib <- bind_rows(in_sample_cv, ten_fold_cv)\ntib$method = c(\"in_sample\", \"out_of_sample\")\ntib\n\n# A tibble: 2 × 5\n mod1 mod2 mod3 mod4 method \n <dbl> <dbl> <dbl> <dbl> <chr> \n1 0.0159 0.0161 0.0164 0.0156 in_sample \n2 0.0158 0.0161 0.0165 0.0161 out_of_sample\n\n\n\n\nUBC Stat 406 - 2023"
+ "section": "00 Gradient descent",
+ "text": "00 Gradient descent\nStat 406\nDaniel J. McDonald\nLast modified – 25 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#meta-lecture",
- "href": "schedule/slides/00-intro-to-class.html#meta-lecture",
+ "objectID": "schedule/slides/00-gradient-descent.html#simple-optimization-techniques",
+ "href": "schedule/slides/00-gradient-descent.html#simple-optimization-techniques",
"title": "UBC Stat406 2023W",
- "section": "00 Intro to class",
- "text": "00 Intro to class\nStat 406\nDaniel J. McDonald\nLast modified – 17 August 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\]"
+ "section": "Simple optimization techniques",
+ "text": "Simple optimization techniques\nWe’ll see “gradient descent” a few times:\n\nsolves logistic regression (simple version of IRWLS)\ngradient boosting\nNeural networks\n\nThis seems like a good time to explain it.\nSo what is it and how does it work?"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#about-me",
- "href": "schedule/slides/00-intro-to-class.html#about-me",
+ "objectID": "schedule/slides/00-gradient-descent.html#very-basic-example",
+ "href": "schedule/slides/00-gradient-descent.html#very-basic-example",
"title": "UBC Stat406 2023W",
- "section": "About me",
- "text": "About me\n\n\n\nDaniel J. McDonald\ndaniel@stat.ubc.ca\nhttp://dajmcdon.github.io/\nAssociate Professor, Department of Statistics"
+ "section": "Very basic example",
+ "text": "Very basic example\n\n\nSuppose I want to minimize \\(f(x)=(x-6)^2\\) numerically.\nI start at a point (say \\(x_1=23\\))\nI want to “go” in the negative direction of the gradient.\nThe gradient (at \\(x_1=23\\)) is \\(f'(23)=2(23-6)=34\\).\nMove current value toward current value - 34.\n\\(x_2 = x_1 - \\gamma 34\\), for \\(\\gamma\\) small.\nIn general, \\(x_{n+1} = x_n -\\gamma f'(x_n)\\).\n\nniter <- 10\ngam <- 0.1\nx <- double(niter)\nx[1] <- 23\ngrad <- function(x) 2 * (x - 6)\nfor (i in 2:niter) x[i] <- x[i - 1] - gam * grad(x[i - 1])"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#philosophy-of-the-class",
- "href": "schedule/slides/00-intro-to-class.html#philosophy-of-the-class",
+ "objectID": "schedule/slides/00-gradient-descent.html#why-does-this-work",
+ "href": "schedule/slides/00-gradient-descent.html#why-does-this-work",
"title": "UBC Stat406 2023W",
- "section": "Philosophy of the class",
- "text": "Philosophy of the class\nI and the TAs are here to help you learn. Ask questions.\nWe encourage engagement, curiosity and generosity\nWe favour steady work through the Term (vs. sleeping until finals)\n\nThe assessments attempt to reflect this ethos."
+ "section": "Why does this work?",
+ "text": "Why does this work?\nHeuristic interpretation:\n\nGradient tells me the slope.\nnegative gradient points toward the minimum\ngo that way, but not too far (or we’ll miss it)"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#more-philosophy",
- "href": "schedule/slides/00-intro-to-class.html#more-philosophy",
+ "objectID": "schedule/slides/00-gradient-descent.html#why-does-this-work-1",
+ "href": "schedule/slides/00-gradient-descent.html#why-does-this-work-1",
"title": "UBC Stat406 2023W",
- "section": "More philosophy",
- "text": "More philosophy\nWhen the term ends, I want\n\nYou to be better at coding.\nYou to have an understanding of the variety of methods available to do prediction and data analysis.\nYou to articulate their strengths and weaknesses.\nYou to be able to choose between different methods using your intuition and the data.\n\n\nI do not want\n\nYou to be under undo stress\nYou to feel the need to cheat, plagiarize, or drop the course\nYou to feel treated unfairly."
+ "section": "Why does this work?",
+ "text": "Why does this work?\nMore rigorous interpretation:\n\nTaylor expansion \\[\nf(x) \\approx f(x_0) + \\nabla f(x_0)^{\\top}(x-x_0) + \\frac{1}{2}(x-x_0)^\\top H(x_0) (x-x_0)\n\\]\nreplace \\(H\\) with \\(\\gamma^{-1} I\\)\nminimize this quadratic approximation in \\(x\\): \\[\n0\\overset{\\textrm{set}}{=}\\nabla f(x_0) + \\frac{1}{\\gamma}(x-x_0) \\Longrightarrow x = x_0 - \\gamma \\nabla f(x_0)\n\\]"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#section",
- "href": "schedule/slides/00-intro-to-class.html#section",
+ "objectID": "schedule/slides/00-gradient-descent.html#visually",
+ "href": "schedule/slides/00-gradient-descent.html#visually",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "I promise\n\nTo grade/mark fairly. Good faith effort will be rewarded\nTo be flexible. This semester (like the last 4) is different for everyone.\nTo understand and adapt to issues.\n\n\nI do not promise that you will all get the grade you want."
+ "section": "Visually",
+ "text": "Visually"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#on-covid",
- "href": "schedule/slides/00-intro-to-class.html#on-covid",
+ "objectID": "schedule/slides/00-gradient-descent.html#visually-1",
+ "href": "schedule/slides/00-gradient-descent.html#visually-1",
"title": "UBC Stat406 2023W",
- "section": "On COVID",
- "text": "On COVID\n\n\n\nI work on COVID a lot.\nStatistics is hugely important.\n\nPolicies (TL; DR)\n\nI encourage you to wear a mask\nDo NOT come to class if you are possibly sick\nBe kind and considerate to others\nThe Marking scheme is flexible enough to allow some missed classes"
+ "section": "Visually",
+ "text": "Visually"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#section-1",
- "href": "schedule/slides/00-intro-to-class.html#section-1",
+ "objectID": "schedule/slides/00-gradient-descent.html#what-gamma-more-details-than-we-have-time-for",
+ "href": "schedule/slides/00-gradient-descent.html#what-gamma-more-details-than-we-have-time-for",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "We’ll talk about lots of ML models\nBut our focus is on how to “understand” everything in this diagram.\nHow do we interpret? Evaluate? Choose a model?\nWhat are the implications / assumptions implied by our choices?\nDeep understanding of statistics helps with intuition."
+ "section": "What \\(\\gamma\\)? (more details than we have time for)",
+ "text": "What \\(\\gamma\\)? (more details than we have time for)\nWhat to use for \\(\\gamma_k\\)?\nFixed\n\nOnly works if \\(\\gamma\\) is exactly right\nUsually does not work\n\nDecay on a schedule\n\\(\\gamma_{n+1} = \\frac{\\gamma_n}{1+cn}\\) or \\(\\gamma_{n} = \\gamma_0 b^n\\)\nExact line search\n\nTells you exactly how far to go.\nAt each iteration \\(n\\), solve \\(\\gamma_n = \\arg\\min_{s \\geq 0} f( x^{(n)} - s f(x^{(n-1)}))\\)\nUsually can’t solve this."
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#predictive-models",
- "href": "schedule/slides/00-intro-to-class.html#predictive-models",
+ "objectID": "schedule/slides/00-gradient-descent.html#section",
+ "href": "schedule/slides/00-gradient-descent.html#section",
"title": "UBC Stat406 2023W",
- "section": "Predictive models",
- "text": "Predictive models\n\n1. Preprocessing\ncentering / scaling / factors-to-dummies / basis expansion / missing values / dimension reduction / discretization / transformations\n2. Model fitting\nWhich box do you use?\n3. Prediction\nRepeat all the preprocessing on new data. But be careful.\n4. Postprocessing, interpretation, and evaluation"
+ "section": "",
+ "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\nx <- matrix(0, 40, 2); x[1, ] <- c(1, 1)\ngrad <- function(x) c(2, 1) * x"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#section-5",
- "href": "schedule/slides/00-intro-to-class.html#section-5",
+ "objectID": "schedule/slides/00-gradient-descent.html#section-1",
+ "href": "schedule/slides/00-gradient-descent.html#section-1",
"title": "UBC Stat406 2023W",
"section": "",
- "text": "Source: https://vas3k.com/blog/machine_learning/"
+ "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .1\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#modules",
- "href": "schedule/slides/00-intro-to-class.html#modules",
+ "objectID": "schedule/slides/00-gradient-descent.html#section-2",
+ "href": "schedule/slides/00-gradient-descent.html#section-2",
"title": "UBC Stat406 2023W",
- "section": "6 modules",
- "text": "6 modules\n\n\n\nReview (today and next week)\nModel accuracy and selection\nRegularization, smoothing, trees\nClassifiers\nModern techniques (classification and regression)\nUnsupervised learning\n\n\n\nEach module is approximately 2 weeks long\nEach module is based on a collection of readings and lectures\nEach module (except the review) has a homework assignment"
+ "section": "",
+ "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .9 # bigger gamma\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#assessments",
- "href": "schedule/slides/00-intro-to-class.html#assessments",
+ "objectID": "schedule/slides/00-gradient-descent.html#section-3",
+ "href": "schedule/slides/00-gradient-descent.html#section-3",
"title": "UBC Stat406 2023W",
- "section": "Assessments",
- "text": "Assessments\nEffort-based\nTotal across three components: 65 points, any way you want\n\nLabs, up to 20 points (2 each)\nAssignments, up to 50 points (10 each)\nClickers, up to 10 points\n\n\nKnowledge-based\nFinal Exam, 35 points"
+ "section": "",
+ "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .9 # big, but decrease it on schedule\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * .9^k * grad(x[k - 1, ])"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#why-this-scheme",
- "href": "schedule/slides/00-intro-to-class.html#why-this-scheme",
+ "objectID": "schedule/slides/00-gradient-descent.html#section-4",
+ "href": "schedule/slides/00-gradient-descent.html#section-4",
"title": "UBC Stat406 2023W",
- "section": "Why this scheme?",
- "text": "Why this scheme?\n\nYou stay on top of the material\nYou come to class and participate\nYou gain coding practice in the labs\nYou work hard on the assignments\n\n\n\n\n\n\n\nMost of this is Effort Based\n\n\nwork hard, guarantee yourself 65%"
+ "section": "",
+ "text": "\\[ f(x_1,x_2) = x_1^2 + 0.5x_2^2\\]\n\ngamma <- .5 # theoretically optimal\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#time-expectations-per-week",
- "href": "schedule/slides/00-intro-to-class.html#time-expectations-per-week",
+ "objectID": "schedule/slides/00-gradient-descent.html#when-do-we-stop",
+ "href": "schedule/slides/00-gradient-descent.html#when-do-we-stop",
"title": "UBC Stat406 2023W",
- "section": "Time expectations per week:",
- "text": "Time expectations per week:\n\nComing to class – 3 hours\nReading the book – 1 hour\nLabs – 1 hour\nHomework – 4 hours\nStudy / thinking / playing – 1 hour\n\n\nShow the course website https://ubc-stat.github.io/stat-406/"
+ "section": "When do we stop?",
+ "text": "When do we stop?\nFor \\(\\epsilon>0\\), small\nCheck any / all of\n\n\\(|f'(x)| < \\epsilon\\)\n\\(|x^{(k)} - x^{(k-1)}| < \\epsilon\\)\n\\(|f(x^{(k)}) - f(x^{(k-1)})| < \\epsilon\\)"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#labs-assignments",
- "href": "schedule/slides/00-intro-to-class.html#labs-assignments",
+ "objectID": "schedule/slides/00-gradient-descent.html#stochastic-gradient-descent",
+ "href": "schedule/slides/00-gradient-descent.html#stochastic-gradient-descent",
"title": "UBC Stat406 2023W",
- "section": "Labs / Assignments",
- "text": "Labs / Assignments\nThe goal is to “Do the work”\n\n\nAssignments\n\nNot easy, especially the first 2, especially if you are unfamiliar with R / Rmarkdown / ggplot\nYou may revise to raise your score to 7/10, see Syllabus. Only if you get lose 3+ for content (penalties can’t be redeemed).\nDon’t leave these for the last minute\n\n\nLabs\n\nLabs should give you practice, allow for questions with the TAs.\nThey are due at 2300 on the day of your lab, lightly graded.\nYou may do them at home, but you must submit individually (in lab, you may share submission)\nLabs are lightly graded"
+ "section": "Stochastic gradient descent",
+ "text": "Stochastic gradient descent\nSuppose \\(f(x) = \\frac{1}{n}\\sum_{i=1}^n f_i(x)\\)\nLike if \\(f(\\beta) = \\frac{1}{n}\\sum_{i=1}^n (y_i - x^\\top_i\\beta)^2\\).\nThen \\(f'(\\beta) = \\frac{1}{n}\\sum_{i=1}^n f'_i(\\beta) = \\frac{1}{n} \\sum_{i=1}^n -2x_i^\\top(y_i - x^\\top_i\\beta)\\)\nIf \\(n\\) is really big, it may take a long time to compute \\(f'\\)\nSo, just sample some partition our data into mini-batches \\(\\mathcal{M}_j\\)\nAnd approximate (imagine the Law of Large Numbers, use a sample to approximate the population) \\[f'(x) = \\frac{1}{n}\\sum_{i=1}^n f'_i(x) \\approx \\frac{1}{m}\\sum_{i\\in\\mathcal{M}_j}f'_{i}(x)\\]"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#clickers",
- "href": "schedule/slides/00-intro-to-class.html#clickers",
+ "objectID": "schedule/slides/00-gradient-descent.html#sgd",
+ "href": "schedule/slides/00-gradient-descent.html#sgd",
"title": "UBC Stat406 2023W",
- "section": "Clickers",
- "text": "Clickers\n\nQuestions are similar to the Final\n0 points for skipping, 2 points for trying, 4 points for correct\n\nAverage of 3 = 10 points (the max)\nAverage of 2 = 5 points\nAverage of 1 = 0 points\ntotal = max(0, min(5 * points / N - 5, 10))\n\nBe sure to sync your device in Canvas.\n\n\n\n\n\n\n\nDon’t do this!\n\n\nAverage < 1 drops your Final Mark 1 letter grade.\nA- becomes B-, C+ becomes D."
+ "section": "SGD",
+ "text": "SGD\n\\[\n\\begin{aligned}\nf'(\\beta) &= \\frac{1}{n}\\sum_{i=1}^n f'_i(\\beta) = \\frac{1}{n} \\sum_{i=1}^n -2x_i^\\top(y_i - x^\\top_i\\beta)\\\\\nf'(x) &= \\frac{1}{n}\\sum_{i=1}^n f'_i(x) \\approx \\frac{1}{m}\\sum_{i\\in\\mathcal{M}_j}f'_{i}(x)\n\\end{aligned}\n\\]\nUsually cycle through “mini-batches”:\n\nUse a different mini-batch at each iteration of GD\nCycle through until we see all the data\n\nThis is the workhorse for neural network optimization"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#final-exam",
- "href": "schedule/slides/00-intro-to-class.html#final-exam",
+ "objectID": "schedule/slides/00-gradient-descent.html#gradient-descent-for-logistic-regression",
+ "href": "schedule/slides/00-gradient-descent.html#gradient-descent-for-logistic-regression",
"title": "UBC Stat406 2023W",
- "section": "Final Exam",
- "text": "Final Exam\n\nScheduled by the university.\nIt is hard\nThe median last year was 50% \\(\\Rightarrow\\) A-\n\nPhilosophy:\n\nIf you put in the effort, you’re guaranteed a C+.\nBut to get an A+, you should really deeply understand the material.\n\nNo penalty for skipping the final.\nIf you’re cool with C+ and hate tests, then that’s fine."
+ "section": "Gradient descent for Logistic regression",
+ "text": "Gradient descent for Logistic regression\nSuppose \\(Y=1\\) with probability \\(p(x)\\) and \\(Y=0\\) with probability \\(1-p(x)\\), \\(x \\in \\R\\).\nI want to model \\(P(Y=1| X=x)\\).\nI’ll assume that \\(\\log\\left(\\frac{p(x)}{1-p(x)}\\right) = ax\\) for some scalar \\(a\\). This means that \\(p(x) = \\frac{\\exp(ax)}{1+\\exp(ax)} = \\frac{1}{1+\\exp(-ax)}\\)\n\n\n\nn <- 100\na <- 2\nx <- runif(n, -5, 5)\nlogit <- function(x) 1 / (1 + exp(-x))\np <- logit(a * x)\ny <- rbinom(n, 1, p)\ndf <- tibble(x, y)\nggplot(df, aes(x, y)) +\n geom_point(colour = \"cornflowerblue\") +\n stat_function(fun = ~ logit(a * .x))"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#advice",
- "href": "schedule/slides/00-intro-to-class.html#advice",
+ "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood",
+ "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood",
"title": "UBC Stat406 2023W",
- "section": "Advice",
- "text": "Advice\n\nSkipping HW makes it difficult to get to 65\nCome to class!\nYes it’s at 8am. I hate it too.\nTo compensate, I will record the class and post to Canvas.\nIn terms of last year’s class, attendance in lecture and active engagement (asking questions, coming to office hours, etc.) is the best predictor of success.\n\n\nQuestions?"
+ "section": "Reminder: the likelihood",
+ "text": "Reminder: the likelihood\n\\[\nL(y | a, x) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\n\\[\n\\begin{aligned}\n\\ell(y | a, x) &= \\log \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\n= \\sum_{i=1}^n y_i\\log p(x_i) + (1-y_i)\\log(1-p(x_i))\\\\\n&= \\sum_{i=1}^n\\log(1-p(x_i)) + y_i\\log\\left(\\frac{p(x_i)}{1-p(x_i)}\\right)\\\\\n&=\\sum_{i=1}^n ax_i y_i + \\log\\left(1-p(x_i)\\right)\\\\\n&=\\sum_{i=1}^n ax_i y_i + \\log\\left(\\frac{1}{1+\\exp(ax_i)}\\right)\n\\end{aligned}\n\\]"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#textbooks",
- "href": "schedule/slides/00-intro-to-class.html#textbooks",
+ "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-1",
+ "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-1",
"title": "UBC Stat406 2023W",
- "section": "Textbooks",
- "text": "Textbooks\n\n\n\n\n\n\nAn Introduction to Statistical Learning\n\n\nJames, Witten, Hastie, Tibshirani, 2013, Springer, New York. (denoted [ISLR])\nAvailable free online: http://statlearning.com/\n\n\n\n\n\n\n\n\n\nThe Elements of Statistical Learning\n\n\nHastie, Tibshirani, Friedman, 2009, Second Edition, Springer, New York. (denoted [ESL])\nAlso available free online: https://web.stanford.edu/~hastie/ElemStatLearn/\n\n\n\n\nIt’s worth your time to read.\nIf you need more practice, read the Worksheets."
+ "section": "Reminder: the likelihood",
+ "text": "Reminder: the likelihood\n\\[\nL(y | a, x) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\nNow, we want the negative of this. Why?\nWe would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by \\(1/n\\))\n\\[-\\ell(y | a, x) = \\frac{1}{n}\\sum_{i=1}^n -ax_i y_i - \\log\\left(\\frac{1}{1+\\exp(ax_i)}\\right)\\]"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#computer",
- "href": "schedule/slides/00-intro-to-class.html#computer",
+ "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-2",
+ "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-2",
"title": "UBC Stat406 2023W",
- "section": "Computer",
- "text": "Computer\n\n\n\n\n\nAll coding in R\nSuggest you use RStudio IDE\nSee https://ubc-stat.github.io/stat-406/ for instructions\nIt tells you how to install what you will need, hopefully all at once, for the whole Term.\nWe will use R and we assume some background knowledge.\nLinks to useful supplementary resources are available on the website.\n\n\n\n\n\n\nThis course is not an intro to R / python / MongoDB / SQL."
+ "section": "Reminder: the likelihood",
+ "text": "Reminder: the likelihood\n\\[\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\nThis is, in the notation of our slides \\(f(a)\\).\nWe want to minimize it in \\(a\\) by gradient descent.\nSo we need the derivative with respect to \\(a\\): \\(f'(a)\\).\nNow, conveniently, this simplifies a lot.\n\\[\n\\begin{aligned}\n\\frac{d}{d a} f(a) &= \\frac{1}{n}\\sum_{i=1}^n -x_i y_i - \\left(-\\frac{x_i \\exp(ax_i)}{1+\\exp(ax_i)}\\right)\\\\\n&=\\frac{1}{n}\\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \\frac{1}{n}\\sum_{i=1}^n -x_i(y_i-p(x_i)).\n\\end{aligned}\n\\]"
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#other-resources",
- "href": "schedule/slides/00-intro-to-class.html#other-resources",
+ "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-3",
+ "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-3",
"title": "UBC Stat406 2023W",
- "section": "Other resources",
- "text": "Other resources\n\nCanvas\n\nGrades, links to videos from class\n\nCourse website\n\nAll the material (slides, extra worksheets) https://ubc-stat.github.io/stat-406\n\nSlack\n\nDiscussion board, questions.\n\nGithub\n\nHomework / Lab submission\n\n\n\n\n\nAll lectures will be recorded and posted\nI cannot guarantee that they will all work properly (sometimes I mess it up)"
+ "section": "Reminder: the likelihood",
+ "text": "Reminder: the likelihood\n\\[\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\n(Simple) gradient descent to minimize \\(-\\ell(a)\\) or maximize \\(L(y|a,x)\\) is:\n\nInput \\(a_1,\\ \\gamma>0,\\ j_\\max,\\ \\epsilon>0,\\ \\frac{d}{da} -\\ell(a)\\).\nFor \\(j=1,\\ 2,\\ \\ldots,\\ j_\\max\\), \\[a_j = a_{j-1} - \\gamma \\frac{d}{da} (-\\ell(a_{j-1}))\\]\nStop if \\(\\epsilon > |a_j - a_{j-1}|\\) or \\(|d / da\\ \\ell(a)| < \\epsilon\\)."
},
{
- "objectID": "schedule/slides/00-intro-to-class.html#some-more-words",
- "href": "schedule/slides/00-intro-to-class.html#some-more-words",
+ "objectID": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-4",
+ "href": "schedule/slides/00-gradient-descent.html#reminder-the-likelihood-4",
"title": "UBC Stat406 2023W",
- "section": "Some more words",
- "text": "Some more words\n\nLectures are hard. It’s 8am, everyone’s tired.\nCoding is hard. I hope you’ll get better at it.\nI strongly urge you to get up at the same time everyday. My plan is to go to the gym on MWF. It’s really hard to sleep in until 10 on MWF and make class at 8 on T/Th.\n\n\n\nLet’s be kind and understanding to each other.\nI have to give you a grade, but I want that grade to reflect your learning and effort, not other junk.\n\n\n\nIf you need help, please ask."
+ "section": "Reminder: the likelihood",
+ "text": "Reminder: the likelihood\n\\[\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n\\]\n\namle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {\n a <- double(jmax) # place to hold stuff (always preallocate space)\n a[1] <- a0 # starting value\n for (j in 2:jmax) { # avoid possibly infinite while loops\n px <- logit(a[j - 1] * x)\n grad <- mean(-x * (y - px))\n a[j] <- a[j - 1] - gam * grad\n if (abs(grad) < eps || abs(a[j] - a[j - 1]) < eps) break\n }\n a[1:j]\n}"
},
{
- "objectID": "schedule/slides/00-r-review.html#meta-lecture",
- "href": "schedule/slides/00-r-review.html#meta-lecture",
+ "objectID": "schedule/slides/00-gradient-descent.html#try-it",
+ "href": "schedule/slides/00-gradient-descent.html#try-it",
"title": "UBC Stat406 2023W",
- "section": "00 R, Rmarkdown, code, and {tidyverse}: A whirlwind tour",
- "text": "00 R, Rmarkdown, code, and {tidyverse}: A whirlwind tour\nStat 406\nDaniel J. McDonald\nLast modified – 11 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "Try it:",
+ "text": "Try it:\n\nround(too_big <- amle(x, y, 5, 50), 3)\n\n [1] 5.000 3.360 2.019 1.815 2.059 1.782 2.113 1.746 2.180 1.711 2.250 1.684\n[13] 2.309 1.669 2.344 1.663 2.359 1.661 2.364 1.660 2.365 1.660 2.366 1.660\n[25] 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660\n[37] 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660\n[49] 2.366 1.660\n\nround(too_small <- amle(x, y, 5, 1), 3)\n\n [1] 5.000 4.967 4.934 4.902 4.869 4.837 4.804 4.772 4.739 4.707 4.675 4.643\n[13] 4.611 4.579 4.547 4.515 4.483 4.451 4.420 4.388 4.357 4.326 4.294 4.263\n[25] 4.232 4.201 4.170 4.140 4.109 4.078 4.048 4.018 3.988 3.957 3.927 3.898\n[37] 3.868 3.838 3.809 3.779 3.750 3.721 3.692 3.663 3.635 3.606 3.578 3.550\n[49] 3.522 3.494\n\nround(just_right <- amle(x, y, 5, 10), 3)\n\n [1] 5.000 4.672 4.351 4.038 3.735 3.445 3.171 2.917 2.688 2.488 2.322 2.191\n[13] 2.094 2.027 1.983 1.956 1.940 1.930 1.925 1.922 1.920 1.919 1.918 1.918\n[25] 1.918 1.918 1.918 1.917 1.917 1.917 1.917"
},
{
- "objectID": "schedule/slides/00-r-review.html#tour-of-rstudio",
- "href": "schedule/slides/00-r-review.html#tour-of-rstudio",
+ "objectID": "schedule/slides/00-gradient-descent.html#visual",
+ "href": "schedule/slides/00-gradient-descent.html#visual",
"title": "UBC Stat406 2023W",
- "section": "Tour of Rstudio",
- "text": "Tour of Rstudio\nThings to note\n\nConsole\nTerminal\nScripts, .Rmd, Knit\nFiles, Projects\nGetting help\nEnvironment, Git"
+ "section": "Visual",
+ "text": "Visual\n\n\nnegll <- function(a) {\n -a * mean(x * y) -\n rowMeans(log(1 / (1 + exp(outer(a, x)))))\n}\nblah <- list_rbind(\n map(\n rlang::dots_list(\n too_big, too_small, just_right, .named = TRUE\n ), \n as_tibble),\n names_to = \"gamma\"\n) |> mutate(negll = negll(value))\nggplot(blah, aes(value, negll)) +\n geom_point(aes(colour = gamma)) +\n facet_wrap(~gamma, ncol = 1) +\n stat_function(fun = negll, xlim = c(-2.5, 5)) +\n scale_y_log10() + \n xlab(\"a\") + \n ylab(\"negative log likelihood\") +\n geom_vline(xintercept = tail(just_right, 1)) +\n scale_colour_brewer(palette = \"Set1\") +\n theme(legend.position = \"none\")"
},
{
- "objectID": "schedule/slides/00-r-review.html#simple-stuff",
- "href": "schedule/slides/00-r-review.html#simple-stuff",
+ "objectID": "schedule/slides/00-gradient-descent.html#check-vs.-glm",
+ "href": "schedule/slides/00-gradient-descent.html#check-vs.-glm",
"title": "UBC Stat406 2023W",
- "section": "Simple stuff",
- "text": "Simple stuff\n\n\nVectors:\n\nx <- c(1, 3, 4)\nx[1]\n\n[1] 1\n\nx[-1]\n\n[1] 3 4\n\nrev(x)\n\n[1] 4 3 1\n\nc(x, x)\n\n[1] 1 3 4 1 3 4\n\n\n\n\n\nMatrices:\n\nx <- matrix(1:25, nrow = 5, ncol = 5)\nx[1,]\n\n[1] 1 6 11 16 21\n\nx[,-1]\n\n [,1] [,2] [,3] [,4]\n[1,] 6 11 16 21\n[2,] 7 12 17 22\n[3,] 8 13 18 23\n[4,] 9 14 19 24\n[5,] 10 15 20 25\n\nx[c(1,3), 2:3]\n\n [,1] [,2]\n[1,] 6 11\n[2,] 8 13"
+ "section": "Check vs. glm()",
+ "text": "Check vs. glm()\n\nsummary(glm(y ~ x - 1, family = \"binomial\"))\n\n\nCall:\nglm(formula = y ~ x - 1, family = \"binomial\")\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \nx 1.9174 0.4785 4.008 6.13e-05 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 138.629 on 100 degrees of freedom\nResidual deviance: 32.335 on 99 degrees of freedom\nAIC: 34.335\n\nNumber of Fisher Scoring iterations: 7\n\n\n\n\nUBC Stat 406 - 2023"
},
{
- "objectID": "schedule/slides/00-r-review.html#simple-stuff-1",
- "href": "schedule/slides/00-r-review.html#simple-stuff-1",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#meta-lecture",
+ "href": "schedule/slides/00-quiz-0-wrap.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Simple stuff",
- "text": "Simple stuff\n\n\nLists\n\n(l <- list(\n a = letters[1:2], \n b = 1:4, \n c = list(a = 1)))\n\n$a\n[1] \"a\" \"b\"\n\n$b\n[1] 1 2 3 4\n\n$c\n$c$a\n[1] 1\n\nl$a\n\n[1] \"a\" \"b\"\n\nl$c$a\n\n[1] 1\n\nl[\"b\"] # compare to l[[\"b\"]] == l$b\n\n$b\n[1] 1 2 3 4\n\n\n\n\nData frames\n\n(dat <- data.frame(\n z = 1:5, \n b = 6:10, \n c = letters[1:5]))\n\n z b c\n1 1 6 a\n2 2 7 b\n3 3 8 c\n4 4 9 d\n5 5 10 e\n\nclass(dat)\n\n[1] \"data.frame\"\n\ndat$b\n\n[1] 6 7 8 9 10\n\ndat[1,]\n\n z b c\n1 1 6 a\n\n\n\n\nData frames are sort-of lists and sort-of matrices"
+ "section": "00 Quiz 0 fun",
+ "text": "00 Quiz 0 fun\nStat 406\nDaniel J. McDonald\nLast modified – 13 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/00-r-review.html#tibbles",
- "href": "schedule/slides/00-r-review.html#tibbles",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#why-this-class",
+ "href": "schedule/slides/00-quiz-0-wrap.html#why-this-class",
"title": "UBC Stat406 2023W",
- "section": "Tibbles",
- "text": "Tibbles\nThese are {tidyverse} data frames\n\n(dat2 <- tibble(z = 1:5, b = z + 5, c = letters[z]))\n\n# A tibble: 5 × 3\n z b c \n <int> <dbl> <chr>\n1 1 6 a \n2 2 7 b \n3 3 8 c \n4 4 9 d \n5 5 10 e \n\nclass(dat2)\n\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n\n\nWe’ll return to classes in a moment. A tbl_df is a “subclass” of data.frame.\nAnything that data.frame can do, tbl_df can do (better).\nFor instance, the printing is more informative.\nAlso, you can construct one by referencing previously constructed columns."
+ "section": "Why this class?",
+ "text": "Why this class?\n\nMost say requirements.\nInterest in ML/Stat learning\nExpressions of love/affection for Stats/CS/ML\nEnjoyment of past similar classes"
},
{
- "objectID": "schedule/slides/00-r-review.html#understanding-signatures",
- "href": "schedule/slides/00-r-review.html#understanding-signatures",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#why-this-class-1",
+ "href": "schedule/slides/00-quiz-0-wrap.html#why-this-class-1",
"title": "UBC Stat406 2023W",
- "section": "Understanding signatures",
- "text": "Understanding signatures\n\n\nCode\nsig <- sig::sig\n\n\n\nsig(lm)\n\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n\nsig(`+`)\n\nfn <- function(e1, e2)\n\nsig(dplyr::filter)\n\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n\nsig(stats::filter)\n\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n\nsig(rnorm)\n\nfn <- function(n, mean = 0, sd = 1)"
+ "section": "Why this class?",
+ "text": "Why this class?\nMore idiosyncratic:\n\n\n“Professor received Phd from CMU, must be an awesome researcher.”\n“Learn strategies.”\n(paraphrase) “Course structure with less weight on exam helps with anxiety”\n(paraphrase) “I love coding in R and want more of it”\n“Emmmmmmmmmmmmmmmm, to learn some skills from Machine Learning and finish my minor🙃.”\n“destiny”\n“challenges from ChatGPT”\n“I thought Daniel Mcdonald is a cool prof…”\n“I have heard this is the most useful stat course in UBC.”"
},
{
- "objectID": "schedule/slides/00-r-review.html#these-are-all-the-same",
- "href": "schedule/slides/00-r-review.html#these-are-all-the-same",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#syllabus-q",
+ "href": "schedule/slides/00-quiz-0-wrap.html#syllabus-q",
"title": "UBC Stat406 2023W",
- "section": "These are all the same",
- "text": "These are all the same\n\nset.seed(12345)\nrnorm(3)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\nset.seed(12345)\nrnorm(3, 0, 1)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n\n[1] 0.5855288 0.7094660 -0.1093033\n\n\n\nFunctions can have default values.\nYou may, but don’t have to, name the arguments\nIf you name them, you can pass them out of order (but you shouldn’t)."
+ "section": "Syllabus Q",
+ "text": "Syllabus Q"
},
{
- "objectID": "schedule/slides/00-r-review.html#write-lots-of-functions.-i-cant-emphasize-this-enough.",
- "href": "schedule/slides/00-r-review.html#write-lots-of-functions.-i-cant-emphasize-this-enough.",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#programming-languages",
+ "href": "schedule/slides/00-quiz-0-wrap.html#programming-languages",
"title": "UBC Stat406 2023W",
- "section": "Write lots of functions. I can’t emphasize this enough.",
- "text": "Write lots of functions. I can’t emphasize this enough.\n\n\n\nf <- function(arg1, arg2, arg3 = 12, ...) {\n stuff <- arg1 * arg3\n stuff2 <- stuff + arg2\n plot(arg1, stuff2, ...)\n return(stuff2)\n}\nx <- rnorm(100)\n\n\n\n\ny1 <- f(x, 3, 15, col = 4, pch = 19)\n\n\n\n\n\n\n\nstr(y1)\n\n num [1:100] -3.8 12.09 -24.27 12.45 -1.14 ..."
+ "section": "Programming languages",
+ "text": "Programming languages"
},
{
- "objectID": "schedule/slides/00-r-review.html#outputs-vs.-side-effects",
- "href": "schedule/slides/00-r-review.html#outputs-vs.-side-effects",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#matrix-inversion",
+ "href": "schedule/slides/00-quiz-0-wrap.html#matrix-inversion",
"title": "UBC Stat406 2023W",
- "section": "Outputs vs. Side effects",
- "text": "Outputs vs. Side effects\n\n\n\nSide effects are things a function does, outputs can be assigned to variables\nA good example is the hist function\nYou have probably only seen the side effect which is to plot the histogram\n\n\nmy_histogram <- hist(rnorm(1000))\n\n\n\n\n\n\n\n\n\n\n\nstr(my_histogram)\n\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 89 142 200 193 170 74 38 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.178 0.284 0.4 0.386 0.34 0.148 0.076 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n\nclass(my_histogram)\n\n[1] \"histogram\""
+ "section": "Matrix inversion",
+ "text": "Matrix inversion\n\nlibrary(MASS)\nX <- matrix(c(5, 3, 1, -1), nrow = 2)\nX\n\n [,1] [,2]\n[1,] 5 1\n[2,] 3 -1\n\nsolve(X)\n\n [,1] [,2]\n[1,] 0.125 0.125\n[2,] 0.375 -0.625\n\nginv(X)\n\n [,1] [,2]\n[1,] 0.125 0.125\n[2,] 0.375 -0.625\n\nX^(-1)\n\n [,1] [,2]\n[1,] 0.2000000 1\n[2,] 0.3333333 -1"
},
{
- "objectID": "schedule/slides/00-r-review.html#when-writing-functions-program-defensively-ensure-behaviour",
- "href": "schedule/slides/00-r-review.html#when-writing-functions-program-defensively-ensure-behaviour",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#linear-models",
+ "href": "schedule/slides/00-quiz-0-wrap.html#linear-models",
"title": "UBC Stat406 2023W",
- "section": "When writing functions, program defensively, ensure behaviour",
- "text": "When writing functions, program defensively, ensure behaviour\n\n\n\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n\n[1] 3\n\nincrementer(1:4)\n\n[1] 2 3 4 5\n\nincrementer(\"a\")\n\nError in x + 1: non-numeric argument to binary operator\n\n\n\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n\n\n\n\n\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n\nError in incrementer(\"a\"): `x` must be numeric\n\nincrementer(2, -3) ## oops!\n\n[1] 3\n\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n\n[1] -1"
+ "section": "Linear models",
+ "text": "Linear models\n\ny <- X %*% c(2, -1) + rnorm(2)\ncoefficients(lm(y ~ X))\n\n(Intercept) X1 X2 \n 4.8953718 0.9380314 NA \n\ncoef(lm(y ~ X))\n\n(Intercept) X1 X2 \n 4.8953718 0.9380314 NA \n\nsolve(t(X) %*% X) %*% t(X) %*% y\n\n [,1]\n[1,] 2.161874\n[2,] -1.223843\n\nsolve(crossprod(X), crossprod(X, y))\n\n [,1]\n[1,] 2.161874\n[2,] -1.223843\n\n\n\nX \\ y # this is Matlab\n\nError: <text>:1:3: unexpected '\\\\'\n1: X \\\n ^"
},
{
- "objectID": "schedule/slides/00-r-review.html#how-to-keep-track",
- "href": "schedule/slides/00-r-review.html#how-to-keep-track",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#pets-and-plans",
+ "href": "schedule/slides/00-quiz-0-wrap.html#pets-and-plans",
"title": "UBC Stat406 2023W",
- "section": "How to keep track",
- "text": "How to keep track\n\n\n\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n if (!is.numeric(inc_by)) {\n stop(\"`inc_by` must be numeric\")\n }\n x + inc_by\n}\nexpect_error(incrementer(\"a\"))\nexpect_equal(incrementer(1:3), 2:4)\nexpect_equal(incrementer(2, -3), -1)\nexpect_error(incrementer(1, \"b\"))\nexpect_identical(incrementer(1:3), 2:4)\n\nError: incrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n\n\n\n\n\nis.integer(2:4)\n\n[1] TRUE\n\nis.integer(incrementer(1:3))\n\n[1] FALSE\n\nexpect_identical(incrementer(1:3, 1L), 2:4)\n\n\n\n\n\n\n\n\n\n\nImportant\n\n\nIf you copy something, write a function.\nValidate your arguments.\nTo ensure proper functionality, write tests to check if inputs result in predicted outputs."
+ "section": "Pets and plans",
+ "text": "Pets and plans"
},
{
- "objectID": "schedule/slides/00-r-review.html#classes",
- "href": "schedule/slides/00-r-review.html#classes",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#grade-predictions",
+ "href": "schedule/slides/00-quiz-0-wrap.html#grade-predictions",
"title": "UBC Stat406 2023W",
- "section": "Classes",
- "text": "Classes\n\n\nWe saw some of these earlier:\n\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib )\nclass(tib)\n\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n\nclass(mdl)\n\n[1] \"lm\"\n\n\nThe class allows for the use of “methods”\n\nprint(mdl)\n\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n -0.1742 1.0454 2.0470 \n\n\n\n\n\nR “knows what to do” when you print() an object of class \"lm\".\nprint() is called a “generic” function.\nYou can create “methods” that get dispatched.\nFor any generic, R looks for a method for the class.\nIf available, it calls that function."
+ "section": "Grade predictions",
+ "text": "Grade predictions\n\n\n4 people say 100%\n24 say 90%\n25 say 85%\n27 say 80%\nLots of clumping\n\n\n1 said 35, and 1 said 50. Woof!"
},
{
- "objectID": "schedule/slides/00-r-review.html#viewing-the-dispatch-chain",
- "href": "schedule/slides/00-r-review.html#viewing-the-dispatch-chain",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year",
+ "href": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year",
"title": "UBC Stat406 2023W",
- "section": "Viewing the dispatch chain",
- "text": "Viewing the dispatch chain\n\nsloop::s3_dispatch(print(incrementer))\n\n=> print.function\n * print.default\n\nsloop::s3_dispatch(print(tib))\n\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n\nsloop::s3_dispatch(print(mdl))\n\n=> print.lm\n * print.default"
+ "section": "Prediction accuracy (last year)",
+ "text": "Prediction accuracy (last year)"
},
{
- "objectID": "schedule/slides/00-r-review.html#r-geeky-but-important",
- "href": "schedule/slides/00-r-review.html#r-geeky-but-important",
+ "objectID": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year-1",
+ "href": "schedule/slides/00-quiz-0-wrap.html#prediction-accuracy-last-year-1",
"title": "UBC Stat406 2023W",
- "section": "R-Geeky But Important",
- "text": "R-Geeky But Important\nThere are lots of generic functions in R\nCommon ones are print(), summary(), and plot().\nAlso, lots of important statistical modelling concepts: residuals() coef()\n(In python, these work the opposite way: obj.residuals. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.)\n\nThe convention is that the specialized function is named method.class(), e.g., summary.lm().\nIf no specialized function is defined, R will try to use method.default().\n\nFor this reason, R programmers try to avoid . in names of functions or objects."
+ "section": "Prediction accuracy (last year)",
+ "text": "Prediction accuracy (last year)\n\nsummary(lm(actual ~ predicted - 1, data = acc))\n\n\nCall:\nlm(formula = actual ~ predicted - 1, data = acc)\n\nResiduals:\n Min 1Q Median 3Q Max \n-63.931 -2.931 1.916 6.052 21.217 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \npredicted 0.96590 0.01025 94.23 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 10.2 on 137 degrees of freedom\n (8 observations deleted due to missingness)\nMultiple R-squared: 0.9848, Adjusted R-squared: 0.9847 \nF-statistic: 8880 on 1 and 137 DF, p-value: < 2.2e-16\n\n\n\n\nUBC Stat 406 - 2023"
},
{
- "objectID": "schedule/slides/00-r-review.html#wherefore-methods",
- "href": "schedule/slides/00-r-review.html#wherefore-methods",
+ "objectID": "schedule/slides/00-version-control.html#meta-lecture",
+ "href": "schedule/slides/00-version-control.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Wherefore methods?",
- "text": "Wherefore methods?\n\nThe advantage is that you don’t have to learn a totally new syntax to grab residuals or plot things\nYou just use residuals(mdl) whether mdl has class lm could have been done two centuries ago, or a Batrachian Emphasis Machine which won’t be invented for another five years.\nThe one draw-back is the help pages for the generic methods tend to be pretty vague\nCompare ?summary with ?summary.lm."
+ "section": "00 Git, Github, and Slack",
+ "text": "00 Git, Github, and Slack\nStat 406\nDaniel J. McDonald\nLast modified – 11 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/00-r-review.html#different-environments",
- "href": "schedule/slides/00-r-review.html#different-environments",
+ "objectID": "schedule/slides/00-version-control.html#course-communication",
+ "href": "schedule/slides/00-version-control.html#course-communication",
"title": "UBC Stat406 2023W",
- "section": "Different environments",
- "text": "Different environments\n\nThese are often tricky, but are very common.\nMost programming languages have this concept in one way or another.\nIn R code run in the Console produces objects in the “Global environment”\nYou can see what you create in the “Environment” tab.\nBut there’s lots of other stuff.\nMany packages are automatically loaded at startup, so you have access to the functions and data inside\n\nFor example mean(), lm(), plot(), iris (technically iris is lazy-loaded, meaning it’s not in memory until you call it, but it is available)"
+ "section": "Course communication",
+ "text": "Course communication\nWebsite:\nhttps://ubc-stat.github.io/stat-406/\n\nHosted on Github.\nLinks to slides and all materials\nSyllabus is there. Be sure to read it."
},
{
- "objectID": "schedule/slides/00-r-review.html#section",
- "href": "schedule/slides/00-r-review.html#section",
+ "objectID": "schedule/slides/00-version-control.html#course-communication-1",
+ "href": "schedule/slides/00-version-control.html#course-communication-1",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Other packages require you to load them with library(pkg) before their functions are available.\nBut, you can call those functions by prefixing the package name ggplot2::ggplot().\nYou can also access functions that the package developer didn’t “export” for use with ::: like dplyr:::as_across_fn_call()\n\n\nThat is all about accessing “objects in package environments”"
+ "section": "Course communication",
+ "text": "Course communication\nSlack:\n\nLink to join on Canvas. This is our discussion board.\nNote that this data is hosted on servers outside of Canada. You may wish to use a pseudonym to protect your privacy.\nAnything super important will be posted to Slack and Canvas.\nBe sure you get Canvas email.\nIf I am sick, I will cancel class or arrange a substitute."
},
{
- "objectID": "schedule/slides/00-r-review.html#other-issues-with-environments",
- "href": "schedule/slides/00-r-review.html#other-issues-with-environments",
+ "objectID": "schedule/slides/00-version-control.html#course-communication-2",
+ "href": "schedule/slides/00-version-control.html#course-communication-2",
"title": "UBC Stat406 2023W",
- "section": "Other issues with environments",
- "text": "Other issues with environments\nAs one might expect, functions create an environment inside the function.\n\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n\n[1] 14\n\n\nNon-trivial cases are data-masking environments.\n\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n\nError in eval(expr, envir, enclos): object 'x2' not found\n\n\n\nlm() looks “inside” the tib to find y and x2\nThe data variables are added to the lm() environment"
+ "section": "Course communication",
+ "text": "Course communication\nGitHub organization\n\nLinked from the website.\nThis is where you complete / submit assignments / projects / in-class-work\nThis is also hosted on Servers outside Canada https://github.com/stat-406-2023/"
},
{
- "objectID": "schedule/slides/00-r-review.html#other-issues-with-environments-1",
- "href": "schedule/slides/00-r-review.html#other-issues-with-environments-1",
+ "objectID": "schedule/slides/00-version-control.html#why-these",
+ "href": "schedule/slides/00-version-control.html#why-these",
"title": "UBC Stat406 2023W",
- "section": "Other issues with environments",
- "text": "Other issues with environments\nWhen Knit, .Rmd files run in their OWN environment.\nThey are run from top to bottom, with code chunks depending on previous\nThis makes them reproducible.\nJupyter notebooks don’t do this. 😱\nObjects in your local environment are not available in the .Rmd\nObjects in the .Rmd are not available locally.\n\n\n\n\n\n\nTip\n\n\nThe most frequent error I see is:\n\nrunning chunks individually, 1-by-1, and it works\nKnitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Environment that don’t exist in the .Rmd"
+ "section": "Why these?",
+ "text": "Why these?\n\nYes, some data is hosted on servers in the US.\nBut in the real world, no one uses Canvas / Piazza, so why not learn things they do use?\nMuch easier to communicate, “mark” or comment on your work\nMuch more DS friendly\nNote that MDS uses both of these, the Stat and CS departments use both, many faculty use them, Google / Amazon / Facebook use things like these, etc.\n\n\nSlack help from MDS features and rules"
},
{
- "objectID": "schedule/slides/00-r-review.html#section-1",
- "href": "schedule/slides/00-r-review.html#section-1",
+ "objectID": "schedule/slides/00-version-control.html#why-version-control",
+ "href": "schedule/slides/00-version-control.html#why-version-control",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "This error also happens because:\n\nlibrary() calls were made globally but not in the .Rmd\n\nso the packages aren’t loaded\n\npaths to data or other objects are not relative to the .Rmd in your file system\n\nthey must be\n\nCarefully keeping Labs / Assignments in their current location will help to avoid some of these."
+ "section": "Why version control?",
+ "text": "Why version control?\n\nMuch of this lecture is based on material from Colin Rundel and Karl Broman"
},
{
- "objectID": "schedule/slides/00-r-review.html#how-to-fix-code",
- "href": "schedule/slides/00-r-review.html#how-to-fix-code",
+ "objectID": "schedule/slides/00-version-control.html#why-version-control-1",
+ "href": "schedule/slides/00-version-control.html#why-version-control-1",
"title": "UBC Stat406 2023W",
- "section": "How to fix code",
- "text": "How to fix code\n\nIf you’re using a function in a package, start with ?function to see the help\n\nMake sure you’re calling the function correctly.\nTry running the examples.\npaste the error into Google (if you share the error on Slack, I often do this first)\nGo to the package website if it exists, and browse around\n\nIf your .Rmd won’t Knit\n\nDid you make the mistake on the last slide?\nDid it Knit before? Then the bug is in whatever you added.\nDid you never Knit it? Why not?\nCall rstudioapi::restartSession(), then run the Chunks 1-by-1"
+ "section": "Why version control?",
+ "text": "Why version control?\n\nSimple formal system for tracking all changes to a project\nTime machine for your projects\n\nTrack blame and/or praise\nRemove the fear of breaking things\n\nLearning curve is steep, but when you need it you REALLY need it\n\n\n\n\nWords of wisdom\n\n\nYour closest collaborator is you six months ago, but you don’t reply to emails.\n– Paul Wilson"
},
{
- "objectID": "schedule/slides/00-r-review.html#section-2",
- "href": "schedule/slides/00-r-review.html#section-2",
+ "objectID": "schedule/slides/00-version-control.html#why-git",
+ "href": "schedule/slides/00-version-control.html#why-git",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "Adding browser()\n\nOnly useful with your own functions.\nOpen the script with the function, and add browser() to the code somewhere\nThen call your function.\nThe execution will Stop where you added browser() and you’ll have access to the local environment to play around"
+ "section": "Why Git",
+ "text": "Why Git\n\n\n\nYou could use something like Box or Dropbox\nThese are poor-man’s version control\nGit is much more appropriate\nIt works with large groups\nIt’s very fast\nIt’s much better at fixing mistakes\nTech companies use it (so it’s in your interest to have some experience)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThis will hurt, but what doesn’t kill you, makes you stronger."
},
{
- "objectID": "schedule/slides/00-r-review.html#reproducible-examples",
- "href": "schedule/slides/00-r-review.html#reproducible-examples",
+ "objectID": "schedule/slides/00-version-control.html#overview",
+ "href": "schedule/slides/00-version-control.html#overview",
"title": "UBC Stat406 2023W",
- "section": "Reproducible examples",
- "text": "Reproducible examples\n\n\n\n\n\n\nQuestion I get on Slack that I hate:\n\n\n“I ran the code like you had on Slide 39, but it didn’t work.”\n\n\n\n\nIf you want to ask me why the code doesn’t work, you need to show me what’s wrong.\n\n\n\n\n\n\n\nDon’t just paste a screenshot!\n\n\nUnless you get lucky, I won’t be able to figure it out from that. And we’ll both get frustrated.\n\n\n\nWhat you need is a Reproducible Example or reprex.\n\nThis is a small chunk of code that\n\nruns in it’s own environment\nand produces the error."
+ "section": "Overview",
+ "text": "Overview\n\ngit is a command line program that lives on your machine\nIf you want to track changes in a directory, you type git init\nThis creates a (hidden) directory called .git\nThe .git directory contains a history of all changes made to “versioned” files\nThis top directory is referred to as a “repository” or “repo”\nhttp://github.com is a service that hosts a repo remotely and has other features: issues, project boards, pull requests, renders .ipynb & .md\nSome IDEs (pycharm, RStudio, VScode) have built in git\ngit/GitHub is broad and complicated. Here, just what you need"
},
{
- "objectID": "schedule/slides/00-r-review.html#reproducible-examples-how-it-works",
- "href": "schedule/slides/00-r-review.html#reproducible-examples-how-it-works",
+ "objectID": "schedule/slides/00-version-control.html#aside-on-built-in-command-line",
+ "href": "schedule/slides/00-version-control.html#aside-on-built-in-command-line",
"title": "UBC Stat406 2023W",
- "section": "Reproducible examples, How it works",
- "text": "Reproducible examples, How it works\n\nOpen a new .R script.\nPaste your buggy code in the file (no need to save)\nEdit your code to make sure it’s “enough to produce the error” and nothing more. (By rerunning the code a few times.)\nCopy your code.\nCall reprex::reprex(venue = \"r\") from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\nIf it creates other errors, that may be the problem. You may fix the bug on your own!\nIf it doesn’t have errors, then your global environment is Farblunget.\nThe Output is now on your clipboard. Go to Slack and paste it in a message. Then press Cmd+Shift+Enter (on Mac) or Ctrl+Shift+Enter (Windows/Linux). Under Type, select R.\nSend the message, perhaps with more description and an SOS emoji.\n\n\n\n\n\n\n\nNote\n\n\nBecause Reprex runs in it’s own environment, it doesn’t have access to any of the libraries you loaded or the stuff in your global environment. You’ll have to load these things in the script."
+ "section": "Aside on “Built-in” & “Command line”",
+ "text": "Aside on “Built-in” & “Command line”\n\n\n\n\n\n\nTip\n\n\nFirst things first, RStudio and the Terminal\n\n\n\n\nCommand line is the “old” type of computing. You type commands at a prompt and the computer “does stuff”.\nYou may not have seen where this is. RStudio has one built in called “Terminal”\nThe Mac System version is also called “Terminal”. If you have a Linux machine, this should all be familiar.\nWindows is not great at this.\nTo get the most out of Git, you have to use the command line."
},
{
- "objectID": "schedule/slides/00-r-review.html#tidyverse-is-huge",
- "href": "schedule/slides/00-r-review.html#tidyverse-is-huge",
+ "objectID": "schedule/slides/00-version-control.html#typical-workflow",
+ "href": "schedule/slides/00-version-control.html#typical-workflow",
"title": "UBC Stat406 2023W",
- "section": "{tidyverse} is huge",
- "text": "{tidyverse} is huge\nCore tidyverse is nearly 30 different R packages, but we’re going to just talk about a few of them.\nFalls roughly into a few categories:\n\nConvenience functions: {magrittr} and many many others.\nData processing: {dplyr} and many others.\nGraphing: {ggplot2} and some others like {scales}.\nUtilities\n\n\n\nWe’re going to talk quickly about some of it, but ignore much of 2.\nThere’s a lot that’s great about these packages, especially ease of data processing.\nBut it doesn’t always jive with base R (it’s almost a separate proglang at this point)."
+ "section": "Typical workflow",
+ "text": "Typical workflow\n\nDownload a repo from Github\n\ngit clone https://github.com/stat550-2021/lecture-slides.git\n\nCreate a branch\n\ngit branch <branchname>\n\nMake changes to your files.\nAdd your changes to be tracked (“stage” them)\n\ngit add <name/of/tracked/file>\n\nCommit your changes\n\ngit commit -m \"Some explanatory message\"\nRepeat 3–5 as needed. Once you’re satisfied\n\nPush to GitHub\n\ngit push\ngit push -u origin <branchname>"
},
{
- "objectID": "schedule/slides/00-r-review.html#piping",
- "href": "schedule/slides/00-r-review.html#piping",
+ "objectID": "schedule/slides/00-version-control.html#what-should-be-tracked",
+ "href": "schedule/slides/00-version-control.html#what-should-be-tracked",
"title": "UBC Stat406 2023W",
- "section": "Piping",
- "text": "Piping\nThis was introduced by {magrittr} as %>%,\nbut is now in base R (>=4.1.0) as |>.\nNote: there are other pipes in {magrittr} (e.g. %$% and %T%) but I’ve never used them.\nI’ve used the old version for so long, that it’s hard for me to adopt the new one.\nThe point of the pipe is to logically sequence nested operations"
+ "section": "What should be tracked?",
+ "text": "What should be tracked?\n\n\nDefinitely\n\ncode, markdown documentation, tex files, bash scripts/makefiles, …\n\n\n\n\nPossibly\n\nlogs, jupyter notebooks, images (that won’t change), …\n\n\n\n\nQuestionable\n\nprocessed data, static pdfs, …\n\n\n\n\nDefinitely not\n\nfull data, continually updated pdfs, other things compiled from source code, …"
},
{
- "objectID": "schedule/slides/00-r-review.html#example",
- "href": "schedule/slides/00-r-review.html#example",
+ "objectID": "schedule/slides/00-version-control.html#what-things-to-track",
+ "href": "schedule/slides/00-version-control.html#what-things-to-track",
"title": "UBC Stat406 2023W",
- "section": "Example",
- "text": "Example\n\n\n\nmse1 <- print(\n sum(\n residuals(\n lm(y~., data = mutate(\n tib, \n x3 = x1^2,\n x4 = log(x2 + abs(min(x2)) + 1)\n )\n )\n )^2\n )\n)\n\n[1] 6.469568e-29\n\n\n\n\nmse2 <- tib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() |>\n print()\n\n[1] 6.469568e-29"
+ "section": "What things to track",
+ "text": "What things to track\n\nYou decide what is “versioned”.\nA file called .gitignore tells git files or types to never track\n\n# History files\n.Rhistory\n.Rapp.history\n\n# Session Data files\n.RData\n\n# User-specific files\n.Ruserdata\n\n# Compiled junk\n*.o\n*.so\n*.DS_Store\n\nShortcut to track everything (use carefully):\n\ngit add ."
},
{
- "objectID": "schedule/slides/00-r-review.html#section-4",
- "href": "schedule/slides/00-r-review.html#section-4",
+ "objectID": "schedule/slides/00-version-control.html#rules",
+ "href": "schedule/slides/00-version-control.html#rules",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "It may seem like we should push this all the way\n\ntib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() ->\n mse3\n\nThis works, but it’s really annoying."
+ "section": "Rules",
+ "text": "Rules\nHomework and Labs\n\nYou each have your own repo\nYou make a branch\nDO NOT rename files\nMake enough commits (3 for labs, 5 for HW).\nPush your changes (at anytime) and make a PR against main when done.\nTAs review your work.\nOn HW, if you want to revise, make changes in response to feedback and push to the same branch. Then “re-request review”."
},
{
- "objectID": "schedule/slides/00-r-review.html#a-new-one",
- "href": "schedule/slides/00-r-review.html#a-new-one",
+ "objectID": "schedule/slides/00-version-control.html#whats-a-pr",
+ "href": "schedule/slides/00-version-control.html#whats-a-pr",
"title": "UBC Stat406 2023W",
- "section": "A new one…",
- "text": "A new one…\nJust last week, I learned\n\nlibrary(magrittr)\ntib <- tibble(x = 1:5, z = 6:10)\ntib <- tib |> mutate(b = x + z)\ntib\n\n# A tibble: 5 × 3\n x z b\n <int> <int> <int>\n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15\n\n# start over\ntib <- tibble(x = 1:5, z = 6:10)\ntib %<>% mutate(b = x + z)\ntib\n\n# A tibble: 5 × 3\n x z b\n <int> <int> <int>\n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15"
+ "section": "What’s a PR?",
+ "text": "What’s a PR?\n\nThis exists on GitHub (not git)\nDemonstration"
},
{
- "objectID": "schedule/slides/00-r-review.html#data-processing-in-dplyr",
- "href": "schedule/slides/00-r-review.html#data-processing-in-dplyr",
+ "objectID": "schedule/slides/00-version-control.html#whats-a-pr-1",
+ "href": "schedule/slides/00-version-control.html#whats-a-pr-1",
"title": "UBC Stat406 2023W",
- "section": "Data processing in {dplyr}",
- "text": "Data processing in {dplyr}\nThis package has all sorts of things. And it interacts with {tibble} generally.\nThe basic idea is “tibble in, tibble out”.\nSatisfies data masking which means you can refer to columns by name or use helpers like ends_with(\"_rate\")\nMajorly useful operations:\n\nselect() (chooses columns to keep)\nmutate() (showed this already)\ngroup_by()\npivot_longer() and pivot_wider()\nleft_join() and full_join()\nsummarise()\n\n\n\n\n\n\n\nNote\n\n\nfilter() and select() are functions in Base R.\nSometimes you get 🐞 because it called the wrong version.\nTo be sure, prefix it like dplyr::select()."
+ "section": "What’s a PR?",
+ "text": "What’s a PR?\n\nThis exists on GitHub (not git)\nDemonstration"
},
{
- "objectID": "schedule/slides/00-r-review.html#a-useful-data-frame",
- "href": "schedule/slides/00-r-review.html#a-useful-data-frame",
+ "objectID": "schedule/slides/00-version-control.html#some-things-to-be-aware-of",
+ "href": "schedule/slides/00-version-control.html#some-things-to-be-aware-of",
"title": "UBC Stat406 2023W",
- "section": "A useful data frame",
- "text": "A useful data frame\n\nlibrary(epidatr)\ncovid <- covidcast(\n source = \"jhu-csse\",\n signals = \"confirmed_7dav_incidence_prop,deaths_7dav_incidence_prop\",\n time_type = \"day\",\n geo_type = \"state\",\n time_values = epirange(20220801, 20220821),\n geo_values = \"ca,wa\") |>\n fetch() |>\n select(geo_value, time_value, signal, value)\n\ncovid\n\n# A tibble: 84 × 4\n geo_value time_value signal value\n <chr> <date> <chr> <dbl>\n 1 ca 2022-08-01 confirmed_7dav_incidence_prop 45.4\n 2 wa 2022-08-01 confirmed_7dav_incidence_prop 27.7\n 3 ca 2022-08-02 confirmed_7dav_incidence_prop 44.9\n 4 wa 2022-08-02 confirmed_7dav_incidence_prop 27.7\n 5 ca 2022-08-03 confirmed_7dav_incidence_prop 44.5\n 6 wa 2022-08-03 confirmed_7dav_incidence_prop 26.6\n 7 ca 2022-08-04 confirmed_7dav_incidence_prop 42.3\n 8 wa 2022-08-04 confirmed_7dav_incidence_prop 26.6\n 9 ca 2022-08-05 confirmed_7dav_incidence_prop 40.7\n10 wa 2022-08-05 confirmed_7dav_incidence_prop 34.6\n# ℹ 74 more rows"
+ "section": "Some things to be aware of",
+ "text": "Some things to be aware of\n\nmaster vs main\nIf you think you did something wrong, stop and ask for help\nThere are guardrails in place. But those won’t stop a bulldozer.\nThe hardest part is the initial setup. Then, this should all be rinse-and-repeat.\nThis book is great: Happy Git with R\n\nSee Chapter 6 if you have install problems.\nSee Chapter 9 for credential caching (avoid typing a password all the time)\nSee Chapter 13 if RStudio can’t find git"
},
{
- "objectID": "schedule/slides/00-r-review.html#examples",
- "href": "schedule/slides/00-r-review.html#examples",
+ "objectID": "schedule/slides/00-version-control.html#the-maindevelopbranch-workflow",
+ "href": "schedule/slides/00-version-control.html#the-maindevelopbranch-workflow",
"title": "UBC Stat406 2023W",
- "section": "Examples",
- "text": "Examples\nRename the signal to something short.\n\ncovid <- covid |> \n mutate(signal = case_when(\n str_starts(signal, \"confirmed\") ~ \"case_rate\", \n TRUE ~ \"death_rate\"\n ))\n\nSort by time_value then geo_value\n\ncovid <- covid |> arrange(time_value, geo_value)\n\nCalculate grouped medians\n\ncovid |> \n group_by(geo_value, signal) |>\n summarise(med = median(value), .groups = \"drop\")\n\n# A tibble: 4 × 3\n geo_value signal med\n <chr> <chr> <dbl>\n1 ca case_rate 33.2 \n2 ca death_rate 0.112\n3 wa case_rate 23.2 \n4 wa death_rate 0.178"
+ "section": "The main/develop/branch workflow",
+ "text": "The main/develop/branch workflow\n\nWhen working on your own\n\nDon’t NEED branches (but you should use them, really)\nI make a branch if I want to try a modification without breaking what I have.\n\nWhen working on a large team with production grade software\n\nmain is protected, released version of software (maybe renamed to release)\ndevelop contains things not yet on main, but thoroughly tested\nOn a schedule (once a week, once a month) develop gets merged to main\nYou work on a feature branch off develop to build your new feature\nYou do a PR against develop. Supervisors review your contributions\n\n\n\nI and many DS/CS/Stat faculty use this workflow with my lab."
},
{
- "objectID": "schedule/slides/00-r-review.html#examples-1",
- "href": "schedule/slides/00-r-review.html#examples-1",
+ "objectID": "schedule/slides/00-version-control.html#protection",
+ "href": "schedule/slides/00-version-control.html#protection",
"title": "UBC Stat406 2023W",
- "section": "Examples",
- "text": "Examples\nSplit the data into two tibbles by signal\n\ncases <- covid |> \n filter(signal == \"case_rate\") |>\n rename(case_rate = value) |> select(-signal)\ndeaths <- covid |> \n filter(signal == \"death_rate\") |>\n rename(death_rate = value) |> select(-signal)\n\nJoin them together\n\njoined <- full_join(cases, deaths, by = c(\"geo_value\", \"time_value\"))\n\nDo the same thing by pivoting\n\ncovid |> pivot_wider(names_from = signal, values_from = value)\n\n# A tibble: 42 × 4\n geo_value time_value case_rate death_rate\n <chr> <date> <dbl> <dbl>\n 1 ca 2022-08-01 45.4 0.105\n 2 wa 2022-08-01 27.7 0.169\n 3 ca 2022-08-02 44.9 0.106\n 4 wa 2022-08-02 27.7 0.169\n 5 ca 2022-08-03 44.5 0.107\n 6 wa 2022-08-03 26.6 0.173\n 7 ca 2022-08-04 42.3 0.112\n 8 wa 2022-08-04 26.6 0.173\n 9 ca 2022-08-05 40.7 0.116\n10 wa 2022-08-05 34.6 0.225\n# ℹ 32 more rows"
+ "section": "Protection",
+ "text": "Protection\n\nTypical for your PR to trigger tests to make sure you don’t break things\nTypical for team members or supervisors to review your PR for compliance"
},
{
- "objectID": "schedule/slides/00-r-review.html#plotting-with-ggplot2",
- "href": "schedule/slides/00-r-review.html#plotting-with-ggplot2",
+ "objectID": "schedule/slides/00-version-control.html#guardrails",
+ "href": "schedule/slides/00-version-control.html#guardrails",
"title": "UBC Stat406 2023W",
- "section": "Plotting with {ggplot2}",
- "text": "Plotting with {ggplot2}\n\nEverything you can do with ggplot(), you can do with plot(). But the defaults are much prettier.\nIt’s also much easier to adjust by aesthetics / panels by factors.\nIt also uses “data masking”: data goes into ggplot(data = mydata), then the columns are available to the rest.\nIt (sort of) pipes, but by adding layers with +\nIt strongly prefers “long” data frames over “wide” data frames.\n\n\nI’ll give a very fast overview of some confusing bits."
+ "section": "Guardrails",
+ "text": "Guardrails\n\nThe .github directory contains interactions with GitHub\n\nActions: On push / PR / other GitHub does something on their server (builds a website, runs tests on code)\nPR templates: Little admonitions when you open a PR\nBranch protection: prevent you from doing stuff\n\nIn this course, I protect main so that you can’t push there\n\n\n\n\n\n\n\nWarning\n\n\nIf you try to push to main, it will give an error like\nremote: error: GH006: Protected branch update failed for refs/heads/main.\nThe fix is: make a new branch, then push that."
},
{
- "objectID": "schedule/slides/01-lm-review.html#meta-lecture",
- "href": "schedule/slides/01-lm-review.html#meta-lecture",
+ "objectID": "schedule/slides/00-version-control.html#operations-in-rstudio",
+ "href": "schedule/slides/00-version-control.html#operations-in-rstudio",
"title": "UBC Stat406 2023W",
- "section": "01 Linear model review",
- "text": "01 Linear model review\nStat 406\nDaniel J. McDonald\nLast modified – 30 August 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\]"
+ "section": "Operations in Rstudio",
+ "text": "Operations in Rstudio\n\n\n\nStage\nCommit\nPush\nPull\nCreate a branch\n\nCovers:\n\nEverything to do your HW / Project if you’re careful\nPlus most other things you “want to do”\n\n\n\nCommand line versions (of the same)\ngit add <name/of/file>\n\ngit commit -m \"some useful message\"\n\ngit push\n\ngit pull\n\ngit checkout -b <name/of/branch>"
},
{
- "objectID": "schedule/slides/01-lm-review.html#the-normal-linear-model",
- "href": "schedule/slides/01-lm-review.html#the-normal-linear-model",
+ "objectID": "schedule/slides/00-version-control.html#other-useful-stuff-but-command-line-only",
+ "href": "schedule/slides/00-version-control.html#other-useful-stuff-but-command-line-only",
"title": "UBC Stat406 2023W",
- "section": "The normal linear model",
- "text": "The normal linear model\nAssume that\n\\[\ny_i = x_i^\\top \\beta + \\epsilon_i.\n\\]\n\n\nWhat is the mean of \\(y_i\\)?\nWhat is the distribution of \\(\\epsilon_i\\)?\nWhat is the notation \\(\\mathbf{X}\\) or \\(\\mathbf{y}\\)?"
+ "section": "Other useful stuff (but command line only)",
+ "text": "Other useful stuff (but command line only)\n\n\nInitializing\ngit config user.name --global \"Daniel J. McDonald\"\ngit config user.email --global \"daniel@stat.ubc.ca\"\ngit config core.editor --global nano \n# or emacs or ... (default is vim)\nStaging\ngit add name/of/file # stage 1 file\ngit add . # stage all\nCommitting\n# stage/commit simultaneously\ngit commit -am \"message\" \n\n# open editor to write long commit message\ngit commit \nPushing\n# If branchname doesn't exist\n# on remote, create it and push\ngit push -u origin branchname\n\n\nBranching\n# switch to branchname, error if uncommitted changes\ngit checkout branchname \n# switch to a previous commit\ngit checkout aec356\n\n# create a new branch\ngit branch newbranchname\n# create a new branch and check it out\ngit checkout -b newbranchname\n\n# merge changes in branch2 onto branch1\ngit checkout branch1\ngit merge branch2\n\n# grab a file from branch2 and put it on current\ngit checkout branch2 -- name/of/file\n\ngit branch -v # list all branches\nCheck the status\ngit status\ngit remote -v # list remotes\ngit log # show recent commits, msgs"
},
{
- "objectID": "schedule/slides/01-lm-review.html#drawing-a-sample",
- "href": "schedule/slides/01-lm-review.html#drawing-a-sample",
+ "objectID": "schedule/slides/00-version-control.html#conflicts",
+ "href": "schedule/slides/00-version-control.html#conflicts",
"title": "UBC Stat406 2023W",
- "section": "Drawing a sample",
- "text": "Drawing a sample\n\\[\ny_i = x_i^\\top \\beta + \\epsilon_i.\n\\]\nHow would I create data from this model (draw a sample)?\n\nSet up parameters\n\np <- 3\nn <- 100\nsigma <- 2\n\n\n\nCreate the data\n\nepsilon <- rnorm(n, sd = sigma) # this is random\nX <- matrix(runif(n * p), n, p) # treat this as fixed, but I need numbers\nbeta <- (p + 1):1 # parameter, also fixed, but I again need numbers\nY <- cbind(1, X) %*% beta + epsilon # epsilon is random, so this is\n## Equiv: Y <- beta[1] + X %*% beta[-1] + epsilon"
+ "section": "Conflicts",
+ "text": "Conflicts\n\nSometimes you merge things and “conflicts” happen.\nMeaning that changes on one branch would overwrite changes on a different branch.\n\n\n\n\nThey look like this:\n\nHere are lines that are either unchanged from\nthe common ancestor, or cleanly resolved \nbecause only one side changed.\n\nBut below we have some troubles\n<<<<<<< yours:sample.txt\nConflict resolution is hard;\nlet's go shopping.\n=======\nGit makes conflict resolution easy.\n>>>>>>> theirs:sample.txt\n\nAnd here is another line that is cleanly \nresolved or unmodified.\n\n\nYou get to decide, do you want to keep\n\nYour changes (above ======)\nTheir changes (below ======)\nBoth.\nNeither.\n\nBut always delete the <<<<<, ======, and >>>>> lines.\nOnce you’re satisfied, committing resolves the conflict."
},
{
- "objectID": "schedule/slides/01-lm-review.html#how-do-we-estimate-beta",
- "href": "schedule/slides/01-lm-review.html#how-do-we-estimate-beta",
+ "objectID": "schedule/slides/00-version-control.html#some-other-pointers",
+ "href": "schedule/slides/00-version-control.html#some-other-pointers",
"title": "UBC Stat406 2023W",
- "section": "How do we estimate beta?",
- "text": "How do we estimate beta?\n\nGuess.\nOrdinary least squares (OLS).\nMaximum likelihood.\nDo something more creative."
+ "section": "Some other pointers",
+ "text": "Some other pointers\n\nCommits have long names: 32b252c854c45d2f8dfda1076078eae8d5d7c81f\n\nIf you want to use it, you need “enough to be unique”: 32b25\n\nOnline help uses directed graphs in ways different from statistics:\n\nIn stats, arrows point from cause to effect, forward in time\nIn git docs, it’s reversed, they point to the thing on which they depend\n\n\nCheat sheet\nhttps://training.github.com/downloads/github-git-cheat-sheet.pdf"
},
{
- "objectID": "schedule/slides/01-lm-review.html#method-2.-ols",
- "href": "schedule/slides/01-lm-review.html#method-2.-ols",
+ "objectID": "schedule/slides/00-version-control.html#how-to-undo-in-3-scenarios",
+ "href": "schedule/slides/00-version-control.html#how-to-undo-in-3-scenarios",
"title": "UBC Stat406 2023W",
- "section": "Method 2. OLS",
- "text": "Method 2. OLS\nI want to find an estimator \\(\\widehat\\beta\\) that makes small errors on my data.\nI measure errors with the difference between predictions \\(\\mathbf{X}\\widehat\\beta\\) and the responses \\(\\mathbf{y}\\).\n\n\nDon’t care if the differences are positive or negative\n\\[\\sum_{i=1}^n \\left\\lvert y_i - x_i^\\top \\widehat\\beta \\right\\rvert.\\]\nThis is hard to minimize (what is the derivative of \\(|\\cdot|\\)?)\n\\[\\sum_{i=1}^n ( y_i - x_i^\\top \\widehat\\beta )^2.\\]"
+ "section": "How to undo in 3 scenarios",
+ "text": "How to undo in 3 scenarios\n\nSuppose we’re concerned about a file named README.md\nOften, git status will give some of these as suggestions\n\n\n\n1. Saved but not staged\n\nIn RStudio, select the file and click then select Revert…\n\n# grab the previously committed version\ngit checkout -- README.md \n2. Staged but not committed\n\nIn RStudio, uncheck the box by the file, then use the method above.\n\n# unstage\ngit reset HEAD README.md\ngit checkout -- README.md\n\n\n3. Committed\n\nNot easy to do in RStudio…\n\n# check the log to see where you made the chg, \ngit log\n# go one step before that (eg to 32b252)\n# and grab that earlier version\ngit checkout 32b252 -- README.md\n\n# alternatively\n# if it happens to also be on another branch\ngit checkout otherbranch -- README.md"
},
{
- "objectID": "schedule/slides/01-lm-review.html#method-2.-ols-solution",
- "href": "schedule/slides/01-lm-review.html#method-2.-ols-solution",
+ "objectID": "schedule/slides/00-version-control.html#recovering-from-things",
+ "href": "schedule/slides/00-version-control.html#recovering-from-things",
"title": "UBC Stat406 2023W",
- "section": "Method 2. OLS solution",
- "text": "Method 2. OLS solution\nWe write this as\n\\[\\widehat\\beta = \\argmin_\\beta \\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n\nFind the \\(\\beta\\) which minimizes the sum of squared errors.\n\n\nNote that this is the same as\n\\[\\widehat\\beta = \\argmin_\\beta \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n\nFind the beta which minimizes the mean squared error."
+ "section": "Recovering from things",
+ "text": "Recovering from things\n\nAccidentally did work on main, Tried to Push but got refused\n\n# make a new branch with everything, but stay on main\ngit branch newbranch\n# find out where to go to\ngit log\n# undo everything after ace2193\ngit reset --hard ace2193\ngit checkout newbranch\n\nMade a branch, did lots of work, realized it’s trash, and you want to burn it\n\ngit checkout main\ngit branch -d badbranch\n\nAnything more complicated, either post to Slack or LMGTFY\nIn the Lab next week, you’ll practice\n\nDoing it right.\nRecovering from some mistakes."
},
{
- "objectID": "schedule/slides/01-lm-review.html#method-2.-ok-do-it",
- "href": "schedule/slides/01-lm-review.html#method-2.-ok-do-it",
+ "objectID": "schedule/slides/02-lm-example.html#meta-lecture",
+ "href": "schedule/slides/02-lm-example.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Method 2. Ok, do it",
- "text": "Method 2. Ok, do it\nWe differentiate and set to zero\n\\[\\begin{aligned}\n& \\frac{\\partial}{\\partial \\beta} \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2\\\\\n&= -\\frac{2}{n}\\sum_{i=1}^n x_i (y_i - x_i^\\top\\beta)\\\\\n&= \\frac{2}{n}\\sum_{i=1}^n x_i x_i^\\top \\beta - x_i y_i\\\\\n0 &\\equiv \\sum_{i=1}^n x_i x_i^\\top \\beta - x_i y_i\\\\\n&\\Rightarrow \\sum_{i=1}^n x_i x_i^\\top \\beta = \\sum_{i=1}^n x_i y_i\\\\\n&\\Rightarrow \\beta = \\left(\\sum_{i=1}^n x_i x_i^\\top\\right)^{-1}\\sum_{i=1}^n x_i y_i\n\\end{aligned}\\]"
+ "section": "02 Linear model example",
+ "text": "02 Linear model example\nStat 406\nDaniel J. McDonald\nLast modified – 06 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\]"
},
{
- "objectID": "schedule/slides/01-lm-review.html#in-matrix-notation",
- "href": "schedule/slides/01-lm-review.html#in-matrix-notation",
+ "objectID": "schedule/slides/02-lm-example.html#economic-mobility",
+ "href": "schedule/slides/02-lm-example.html#economic-mobility",
"title": "UBC Stat406 2023W",
- "section": "In matrix notation…",
- "text": "In matrix notation…\n…this is\n\\[\\hat\\beta = ( \\mathbf{X}^\\top \\mathbf{X})^{-1} \\mathbf{X}^\\top\\mathbf{y}.\\]\nThe \\(\\beta\\) which “minimizes the sum of squared errors”\nAKA, the SSE."
+ "section": "Economic mobility",
+ "text": "Economic mobility\n\ndata(\"mobility\", package = \"Stat406\")\nmobility\n\n# A tibble: 741 × 43\n ID Name Mobility State Population Urban Black Seg_racial Seg_income\n <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>\n 1 100 Johnson Ci… 0.0622 TN 576081 1 0.021 0.09 0.035\n 2 200 Morristown 0.0537 TN 227816 1 0.02 0.093 0.026\n 3 301 Middlesbor… 0.0726 TN 66708 0 0.015 0.064 0.024\n 4 302 Knoxville 0.0563 TN 727600 1 0.056 0.21 0.092\n 5 401 Winston-Sa… 0.0448 NC 493180 1 0.174 0.262 0.072\n 6 402 Martinsvil… 0.0518 VA 92753 0 0.224 0.137 0.024\n 7 500 Greensboro 0.0474 NC 1055133 1 0.218 0.22 0.068\n 8 601 North Wilk… 0.0517 NC 90016 0 0.032 0.114 0.012\n 9 602 Galax 0.0796 VA 64676 0 0.029 0.131 0.005\n10 700 Spartanburg 0.0431 SC 354533 1 0.207 0.139 0.045\n# ℹ 731 more rows\n# ℹ 34 more variables: Seg_poverty <dbl>, Seg_affluence <dbl>, Commute <dbl>,\n# Income <dbl>, Gini <dbl>, Share01 <dbl>, Gini_99 <dbl>, Middle_class <dbl>,\n# Local_tax_rate <dbl>, Local_gov_spending <dbl>, Progressivity <dbl>,\n# EITC <dbl>, School_spending <dbl>, Student_teacher_ratio <dbl>,\n# Test_scores <dbl>, HS_dropout <dbl>, Colleges <dbl>, Tuition <dbl>,\n# Graduation <dbl>, Labor_force_participation <dbl>, Manufacturing <dbl>, …\n\n\n\nNote how many observations and predictors it has.\nWe’ll use Mobility as the response"
},
{
- "objectID": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood",
- "href": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood",
+ "objectID": "schedule/slides/02-lm-example.html#a-linear-model",
+ "href": "schedule/slides/02-lm-example.html#a-linear-model",
"title": "UBC Stat406 2023W",
- "section": "Method 3: maximum likelihood",
- "text": "Method 3: maximum likelihood\nMethod 2 didn’t use anything about the distribution of \\(\\epsilon\\).\nBut if we know that \\(\\epsilon\\) has a normal distribution, we can write down the joint distribution of \\(\\mathbf{y}=(y_1,\\ldots,y_n)^\\top\\):\n\\[\\begin{aligned}\nf_Y(\\mathbf{y} ; \\beta) &= \\prod_{i=1}^n f_{y_i ; \\beta}(y_i)\\\\\n &= \\prod_{i=1}^n \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\exp\\left(-\\frac{1}{2\\sigma^2} (y_i-x_i^\\top \\beta)^2\\right)\\\\\n &= \\left( \\frac{1}{2\\pi\\sigma^2}\\right)^{n/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\right)\n\\end{aligned}\\]"
+ "section": "A linear model",
+ "text": "A linear model\n\\[\\mbox{Mobility}_i = \\beta_0 + \\beta_1 \\, \\mbox{State}_i + \\beta_2 \\, \\mbox{Urban}_i + \\cdots + \\epsilon_i\\]\nor equivalently\n\\[E \\left[ \\biggl. \\mbox{mobility} \\, \\biggr| \\, \\mbox{State}, \\mbox{Urban},\n \\ldots \\right] = \\beta_0 + \\beta_1 \\, \\mbox{State} +\n \\beta_2 \\, \\mbox{Urban} + \\cdots\\]"
},
{
- "objectID": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood-1",
- "href": "schedule/slides/01-lm-review.html#method-3-maximum-likelihood-1",
+ "objectID": "schedule/slides/02-lm-example.html#analysis",
+ "href": "schedule/slides/02-lm-example.html#analysis",
"title": "UBC Stat406 2023W",
- "section": "Method 3: maximum likelihood",
- "text": "Method 3: maximum likelihood\n\\[\nf_Y(\\mathbf{y} ; \\beta) = \\left( \\frac{1}{2\\pi\\sigma^2}\\right)^{n/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\right)\n\\]\nIn probability courses, we think of \\(f_Y\\) as a function of \\(\\mathbf{y}\\) with \\(\\beta\\) fixed:\n\nIf we integrate over \\(\\mathbf{y}\\), it’s \\(1\\).\nIf we want the probability of \\((a,b)\\), we integrate from \\(a\\) to \\(b\\).\netc."
+ "section": "Analysis",
+ "text": "Analysis\n\nRandomly split into a training (say 3/4) and a test set (1/4)\nUse training set to fit a model\nFit the “full” model\n“Look” at the fit\n\n\n\nset.seed(20220914)\nmob <- mobility[complete.cases(mobility), ]\nn <- nrow(mob)\nmob <- mob |> select(-Name, -ID, -State)\nset <- sample.int(n, floor(n * .75), FALSE)\ntrain <- mob[set, ]\ntest <- mob[setdiff(1:n, set), ]\nfull <- lm(Mobility ~ ., data = train)\n\n\nWhy don’t we include Name or ID?"
},
{
- "objectID": "schedule/slides/01-lm-review.html#turn-it-around",
- "href": "schedule/slides/01-lm-review.html#turn-it-around",
+ "objectID": "schedule/slides/02-lm-example.html#results",
+ "href": "schedule/slides/02-lm-example.html#results",
"title": "UBC Stat406 2023W",
- "section": "Turn it around…",
- "text": "Turn it around…\n…instead, think of it as a function of \\(\\beta\\).\nWe call this “the likelihood” of beta: \\(\\mathcal{L}(\\beta)\\).\nGiven some data, we can evaluate the likelihood for any value of \\(\\beta\\) (assuming \\(\\sigma\\) is known).\nIt won’t integrate to 1 over \\(\\beta\\).\nBut it is “convex”,\nmeaning we can maximize it (the second derivative wrt \\(\\beta\\) is everywhere negative)."
+ "section": "Results",
+ "text": "Results\n\nsummary(full)\n\n\nCall:\nlm(formula = Mobility ~ ., data = train)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.072092 -0.010256 -0.001452 0.009170 0.090428 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.849e-01 8.083e-02 2.288 0.022920 * \nPopulation 3.378e-09 2.478e-09 1.363 0.173916 \nUrban 2.853e-03 3.892e-03 0.733 0.464202 \nBlack 7.807e-02 2.859e-02 2.731 0.006735 ** \nSeg_racial -5.626e-02 1.780e-02 -3.160 0.001754 ** \nSeg_income 8.677e-01 9.355e-01 0.928 0.354453 \nSeg_poverty -7.416e-01 5.014e-01 -1.479 0.140316 \nSeg_affluence -2.224e-01 4.763e-01 -0.467 0.640874 \nCommute 6.313e-02 2.838e-02 2.225 0.026915 * \nIncome 4.207e-07 6.997e-07 0.601 0.548112 \nGini 3.592e+00 3.357e+00 1.070 0.285578 \nShare01 -3.635e-02 3.357e-02 -1.083 0.279925 \nGini_99 -3.657e+00 3.356e+00 -1.090 0.276704 \nMiddle_class 1.031e-01 4.835e-02 2.133 0.033828 * \nLocal_tax_rate 2.268e-01 2.620e-01 0.866 0.387487 \nLocal_gov_spending 1.273e-07 3.016e-06 0.042 0.966374 \nProgressivity 4.983e-03 1.324e-03 3.764 0.000205 ***\nEITC -3.324e-04 4.528e-04 -0.734 0.463549 \nSchool_spending -9.019e-04 2.272e-03 -0.397 0.691658 \nStudent_teacher_ratio -1.639e-03 1.123e-03 -1.459 0.145748 \nTest_scores 2.487e-04 3.137e-04 0.793 0.428519 \nHS_dropout -1.698e-01 9.352e-02 -1.816 0.070529 . \nColleges -2.811e-02 7.661e-02 -0.367 0.713942 \nTuition 3.459e-07 4.362e-07 0.793 0.428417 \nGraduation -1.702e-02 1.425e-02 -1.194 0.233650 \nLabor_force_participation -7.850e-02 5.405e-02 -1.452 0.147564 \nManufacturing -1.605e-01 2.816e-02 -5.700 3.1e-08 ***\nChinese_imports -5.165e-04 1.004e-03 -0.514 0.607378 \nTeenage_labor -1.019e+00 2.111e+00 -0.483 0.629639 \nMigration_in 4.490e-02 3.480e-01 0.129 0.897436 \nMigration_out -4.475e-01 4.093e-01 -1.093 0.275224 \nForeign_born 9.137e-02 5.494e-02 1.663 0.097454 . \nSocial_capital -1.114e-03 2.728e-03 -0.408 0.683245 \nReligious 4.570e-02 1.298e-02 3.520 0.000506 ***\nViolent_crime -3.393e+00 1.622e+00 -2.092 0.037373 * \nSingle_mothers -3.590e-01 9.442e-02 -3.802 0.000177 ***\nDivorced 1.707e-02 1.603e-01 0.107 0.915250 \nMarried -5.894e-02 7.246e-02 -0.813 0.416720 \nLongitude -4.239e-05 2.239e-04 -0.189 0.850001 \nLatitude 6.725e-04 5.687e-04 1.182 0.238037 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.02128 on 273 degrees of freedom\nMultiple R-squared: 0.7808, Adjusted R-squared: 0.7494 \nF-statistic: 24.93 on 39 and 273 DF, p-value: < 2.2e-16"
},
{
- "objectID": "schedule/slides/01-lm-review.html#so-lets-maximize",
- "href": "schedule/slides/01-lm-review.html#so-lets-maximize",
+ "objectID": "schedule/slides/02-lm-example.html#diagnostic-plots",
+ "href": "schedule/slides/02-lm-example.html#diagnostic-plots",
"title": "UBC Stat406 2023W",
- "section": "So let’s maximize",
- "text": "So let’s maximize\nThe derivative of this thing is kind of ugly.\nBut if we’re trying to maximize over \\(\\beta\\), we can take an increasing transformation without changing anything.\nI choose \\(\\log_e\\).\n\\[\\begin{aligned}\n\\mathcal{L}(\\beta) &= \\left( \\frac{1}{2\\pi\\sigma^2}\\right)^{n/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\right)\\\\\n\\ell(\\beta) &=-\\frac{n}{2}\\log (2\\pi\\sigma^2) -\\frac{1}{2\\sigma^2} \\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\n\\end{aligned}\\]\nBut we can ignore constants, so this gives\n\\[\\widehat\\beta = \\argmax_\\beta -\\sum_{i=1}^n (y_i-x_i^\\top \\beta)^2\\]\nThe same as before!"
+ "section": "Diagnostic plots",
+ "text": "Diagnostic plots\n\n\npar(mar = c(5, 3, 0, 0))\nplot(full, 1)\n\n\n\n\n\n\n\n\n\n\n\n\nplot(full, 2)"
},
{
- "objectID": "schedule/slides/03-regression-function.html#meta-lecture",
- "href": "schedule/slides/03-regression-function.html#meta-lecture",
+ "objectID": "schedule/slides/02-lm-example.html#section",
+ "href": "schedule/slides/02-lm-example.html#section",
"title": "UBC Stat406 2023W",
- "section": "03 The regression function",
- "text": "03 The regression function\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "",
+ "text": "(Those were plot methods for objects of class lm)\nSame thing in ggplot\n\n\nstuff <- tibble(\n residuals = residuals(full), \n fitted = fitted(full),\n stdresiduals = rstandard(full)\n)\nggplot(stuff, aes(fitted, residuals)) +\n geom_point(colour = \"salmon\") +\n geom_smooth(\n se = FALSE, \n colour = \"steelblue\", \n linewidth = 2) +\n ggtitle(\"Residuals vs Fitted\")\n\n\n\n\n\n\n\n\n\n\n\n\nggplot(stuff, aes(sample = stdresiduals)) +\n geom_qq(colour = \"purple\", size = 2) +\n geom_qq_line(colour = \"peachpuff\", linewidth = 2) +\n labs(\n x = \"Theoretical quantiles\", \n y = \"Standardized residuals\",\n title = \"Normal Q-Q\")"
},
{
- "objectID": "schedule/slides/03-regression-function.html#mean-squared-error-mse",
- "href": "schedule/slides/03-regression-function.html#mean-squared-error-mse",
+ "objectID": "schedule/slides/02-lm-example.html#fit-a-reduced-model",
+ "href": "schedule/slides/02-lm-example.html#fit-a-reduced-model",
"title": "UBC Stat406 2023W",
- "section": "Mean squared error (MSE)",
- "text": "Mean squared error (MSE)\nLast time… Ordinary Least Squares\n\\[\\widehat\\beta = \\argmin_\\beta \\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n“Find the \\(\\beta\\) which minimizes the sum of squared errors.”\n\\[\\widehat\\beta = \\arg\\min_\\beta \\frac{1}{n}\\sum_{i=1}^n ( y_i - x_i^\\top \\beta )^2.\\]\n“Find the beta which minimizes the mean squared error.”"
+ "section": "Fit a reduced model",
+ "text": "Fit a reduced model\n\nreduced <- lm(\n Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +\n Manufacturing + Migration_in + Religious + Single_mothers, \n data = train)\n\nsummary(reduced)$coefficients |> as_tibble()\n\n# A tibble: 9 × 4\n Estimate `Std. Error` `t value` `Pr(>|t|)`\n <dbl> <dbl> <dbl> <dbl>\n1 0.166 0.0178 9.36 1.83e-18\n2 0.0637 0.0149 4.27 2.62e- 5\n3 -0.109 0.0390 -2.79 5.64e- 3\n4 0.000500 0.000256 1.95 5.19e- 2\n5 -0.216 0.0820 -2.64 8.81e- 3\n6 -0.159 0.0202 -7.89 5.65e-14\n7 -0.389 0.172 -2.26 2.42e- 2\n8 0.0436 0.0105 4.16 4.08e- 5\n9 -0.286 0.0466 -6.15 2.44e- 9\n\nreduced |> broom::glance() |> print(width = 120)\n\n# A tibble: 1 × 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n1 0.718 0.711 0.0229 96.9 5.46e-79 8 743. -1466. -1429.\n deviance df.residual nobs\n <dbl> <int> <int>\n1 0.159 304 313"
},
{
- "objectID": "schedule/slides/03-regression-function.html#forget-all-that",
- "href": "schedule/slides/03-regression-function.html#forget-all-that",
+ "objectID": "schedule/slides/02-lm-example.html#diagnostic-plots-for-reduced-model",
+ "href": "schedule/slides/02-lm-example.html#diagnostic-plots-for-reduced-model",
"title": "UBC Stat406 2023W",
- "section": "Forget all that…",
- "text": "Forget all that…\nThat’s “stuff that seems like a good idea”\nAnd it is for many reasons\nThis class is about those reasons, and the “statistics” behind it\n\n\n\nMethods for “Statistical” Learning\nStarts with “what is a model?”"
+ "section": "Diagnostic plots for reduced model",
+ "text": "Diagnostic plots for reduced model\n\n\nplot(reduced, 1)\n\n\n\n\n\n\n\n\n\n\n\n\nplot(reduced, 2)"
},
{
- "objectID": "schedule/slides/03-regression-function.html#what-is-a-model",
- "href": "schedule/slides/03-regression-function.html#what-is-a-model",
+ "objectID": "schedule/slides/02-lm-example.html#how-do-we-decide-which-model-is-better",
+ "href": "schedule/slides/02-lm-example.html#how-do-we-decide-which-model-is-better",
"title": "UBC Stat406 2023W",
- "section": "What is a model?",
- "text": "What is a model?\nIn statistics, “model” has a mathematical meaning.\nDistinct from “algorithm” or “procedure”.\nDefining a model often leads to a procedure/algorithm with good properties.\nSometimes procedure/algorithm \\(\\Rightarrow\\) a specific model.\n\nStatistics (the field) tells me how to understand when different procedures are desirable and the mathematical guarantees that they satisfy.\n\nWhen are certain models appropriate?\n\nOne definition of “Statistical Learning” is the “statistics behind the procedure”."
+ "section": "How do we decide which model is better?",
+ "text": "How do we decide which model is better?\n\n\n\nGoodness of fit versus prediction power\n\n\nmap( # smaller AIC is better\n list(full = full, reduced = reduced), \n ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq))\n\n$full\n aic rsq \n-1482.5981023 0.7807509 \n\n$reduced\n aic rsq \n-1466.088492 0.718245 \n\n\n\nUse both models to predict Mobility\nCompare both sets of predictions\n\n\n\n\nmses <- function(preds, obs) {\n round(mean((obs - preds)^2), 5)\n}\nc(\n full = mses(\n predict(full, newdata = test), \n test$Mobility),\n reduced = mses(\n predict(reduced, newdata = test), \n test$Mobility)\n)\n\n full reduced \n0.00072 0.00084 \n\n\n\n\nCode\ntest$full <- predict(full, newdata = test)\ntest$reduced <- predict(reduced, newdata = test)\ntest |> \n select(Mobility, full, reduced) |>\n pivot_longer(-Mobility) |>\n ggplot(aes(Mobility, value)) + \n geom_point(color = \"orange\") + \n facet_wrap(~name, 2) +\n xlab('observed mobility') + \n ylab('predicted mobility') +\n geom_abline(slope = 1, intercept = 0, col = \"darkblue\")"
},
{
- "objectID": "schedule/slides/03-regression-function.html#statistical-models-101",
- "href": "schedule/slides/03-regression-function.html#statistical-models-101",
+ "objectID": "schedule/slides/04-bias-variance.html#meta-lecture",
+ "href": "schedule/slides/04-bias-variance.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Statistical models 101",
- "text": "Statistical models 101\nWe observe data \\(Z_1,\\ Z_2,\\ \\ldots,\\ Z_n\\) generated by some probability distribution \\(P\\). We want to use the data to learn about \\(P\\).\n\nA statistical model is a set of distributions \\(\\mathcal{P}\\).\n\nSome examples:\n\n\\(\\P = \\{ 0 < p < 1 : P(z=1)=p,\\ P(z=0)=1-p\\}\\).\n\\(\\P = \\{ \\beta \\in \\R^p,\\ \\sigma>0 : Y \\sim N(X^\\top\\beta,\\sigma^2),\\ X\\mbox{ fixed}\\}\\).\n\\(\\P = \\{\\mbox{all CDF's }F\\}\\).\n\\(\\P = \\{\\mbox{all smooth functions } f: \\R^p \\rightarrow \\R : Z_i = (X_i, Y_i),\\ E[Y_i] = f(X_i) \\}\\)"
+ "section": "04 Bias and variance",
+ "text": "04 Bias and variance\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
},
{
- "objectID": "schedule/slides/03-regression-function.html#statistical-models",
- "href": "schedule/slides/03-regression-function.html#statistical-models",
+ "objectID": "schedule/slides/04-bias-variance.html#section",
+ "href": "schedule/slides/04-bias-variance.html#section",
"title": "UBC Stat406 2023W",
- "section": "Statistical models",
- "text": "Statistical models\nWe want to use the data to select a distribution \\(P\\) that probably generated the data.\n\nMy model:\n\\[\n\\P = \\{ P(z=1)=p,\\ P(z=0)=1-p,\\ 0 < p < 1 \\}\n\\]\n\nTo completely characterize \\(P\\), I just need to estimate \\(p\\).\nNeed to assume that \\(P \\in \\P\\).\nThis assumption is mostly empty: need independent, can’t see \\(z=12\\)."
+ "section": "",
+ "text": "We just talked about\n\nVariance of an estimator.\nIrreducible error when making predictions.\nThese are 2 of the 3 components of the “Prediction Risk” \\(R_n\\)"
},
{
- "objectID": "schedule/slides/03-regression-function.html#statistical-models-1",
- "href": "schedule/slides/03-regression-function.html#statistical-models-1",
+ "objectID": "schedule/slides/04-bias-variance.html#component-3-the-bias",
+ "href": "schedule/slides/04-bias-variance.html#component-3-the-bias",
"title": "UBC Stat406 2023W",
- "section": "Statistical models",
- "text": "Statistical models\nWe observe data \\(Z_i=(Y_i,X_i)\\) generated by some probability distribution \\(P\\). We want to use the data to learn about \\(P\\).\n\nMy model\n\\[\n\\P = \\{ \\beta \\in \\R^p, \\sigma>0 : Y_i \\given X_i=x_i \\sim N(x_i^\\top\\beta,\\ \\sigma^2) \\}.\n\\]\n\nTo completely characterize \\(P\\), I just need to estimate \\(\\beta\\) and \\(\\sigma\\).\nNeed to assume that \\(P\\in\\P\\).\nThis time, I have to assume a lot more: (conditional) Linearity, independence, conditional Gaussian noise, no ignored variables, no collinearity, etc."
+ "section": "Component 3, the Bias",
+ "text": "Component 3, the Bias\nWe need to be specific about what we mean when we say bias.\nBias is neither good nor bad in and of itself.\nA very simple example: let \\(Z_1,\\ \\ldots,\\ Z_n \\sim N(\\mu, 1)\\). - We don’t know \\(\\mu\\), so we try to use the data (the \\(Z_i\\)’s) to estimate it.\n\nI propose 3 estimators:\n\n\\(\\widehat{\\mu}_1 = 12\\),\n\\(\\widehat{\\mu}_2=Z_6\\),\n\\(\\widehat{\\mu}_3=\\overline{Z}\\).\n\nThe bias (by definition) of my estimator is \\(E[\\widehat{\\mu_i}]-\\mu\\).\n\n\nCalculate the bias and variance of each estimator."
},
{
- "objectID": "schedule/slides/03-regression-function.html#statistical-models-unfamiliar-example",
- "href": "schedule/slides/03-regression-function.html#statistical-models-unfamiliar-example",
+ "objectID": "schedule/slides/04-bias-variance.html#regression-in-general",
+ "href": "schedule/slides/04-bias-variance.html#regression-in-general",
"title": "UBC Stat406 2023W",
- "section": "Statistical models, unfamiliar example",
- "text": "Statistical models, unfamiliar example\nWe observe data \\(Z_i \\in \\R\\) generated by some probability distribution \\(P\\). We want to use the data to learn about \\(P\\).\nMy model\n\\[\n\\P = \\{ Z_i \\textrm{ has a density function } f \\}.\n\\]\n\nTo completely characterize \\(P\\), I need to estimate \\(f\\).\nIn fact, we can’t hope to do this.\n\nRevised Model 1 - \\(\\P=\\{ Z_i \\textrm{ has a density function } f : \\int (f'')^2 dx < M \\}\\)\nRevised Model 2 - \\(\\P=\\{ Z_i \\textrm{ has a density function } f : \\int (f'')^2 dx < K < M \\}\\)\nRevised Model 3 - \\(\\P=\\{ Z_i \\textrm{ has a density function } f : \\int |f'| dx < M \\}\\)\n\nEach of these suggests different ways of estimating \\(f\\)"
+ "section": "Regression in general",
+ "text": "Regression in general\nIf I want to predict \\(Y\\) from \\(X\\), it is almost always the case that\n\\[\n\\mu(x) = \\Expect{Y\\given X=x} \\neq x^{\\top}\\beta\n\\]\nSo the bias of using a linear model is not zero.\n\nWhy? Because\n\\[\n\\Expect{Y\\given X=x}-x^\\top\\beta \\neq \\Expect{Y\\given X=x} - \\mu(x) = 0.\n\\]\nWe can include as many predictors as we like,\nbut this doesn’t change the fact that the world is non-linear."
},
{
- "objectID": "schedule/slides/03-regression-function.html#assumption-lean-regression",
- "href": "schedule/slides/03-regression-function.html#assumption-lean-regression",
+ "objectID": "schedule/slides/04-bias-variance.html#continuation-predicting-new-ys",
+ "href": "schedule/slides/04-bias-variance.html#continuation-predicting-new-ys",
"title": "UBC Stat406 2023W",
- "section": "Assumption Lean Regression",
- "text": "Assumption Lean Regression\nImagine \\(Z = (Y, \\mathbf{X}) \\sim P\\) with \\(Y \\in \\R\\) and \\(\\mathbf{X} = (1, X_1, \\ldots, X_p)^\\top\\).\nWe are interested in the conditional distribution \\(P_{Y|\\mathbf{X}}\\)\nSuppose we think that there is some function of interest which relates \\(Y\\) and \\(X\\).\nLet’s call this function \\(\\mu(\\mathbf{X})\\) for the moment. How do we estimate \\(\\mu\\)? What is \\(\\mu\\)?\n\n\nTo make this precise, we\n\nHave a model \\(\\P\\).\nNeed to define a “good” functional \\(\\mu\\).\nLet’s loosely define “good” as\n\n\nGiven a new (random) \\(Z\\), \\(\\mu(\\mathbf{X})\\) is “close” to \\(Y\\).\n\n\n\nSee Berk et al. Assumption Lean Regression."
+ "section": "(Continuation) Predicting new Y’s",
+ "text": "(Continuation) Predicting new Y’s\nSuppose we want to predict \\(Y\\),\nwe know \\(E[Y]= \\mu \\in \\mathbb{R}\\) and \\(\\textrm{Var}[Y] = 1\\).\nOur data is \\(\\{y_1,\\ldots,y_n\\}\\)\nWe have considered estimating \\(\\mu\\) in various ways, and using \\(\\widehat{Y} = \\widehat{\\mu}\\)\n\n\nLet’s try one more: \\(\\widehat Y_a = a\\overline{Y}_n\\) for some \\(a \\in (0,1]\\)."
},
{
- "objectID": "schedule/slides/03-regression-function.html#evaluating-close",
- "href": "schedule/slides/03-regression-function.html#evaluating-close",
+ "objectID": "schedule/slides/04-bias-variance.html#one-can-show-wait-for-the-proof",
+ "href": "schedule/slides/04-bias-variance.html#one-can-show-wait-for-the-proof",
"title": "UBC Stat406 2023W",
- "section": "Evaluating “close”",
- "text": "Evaluating “close”\nWe need more functions.\nChoose some loss function \\(\\ell\\) that measures how close \\(\\mu\\) and \\(Y\\) are.\n\n\n\nSquared-error:\n\\(\\ell(y,\\ \\mu) = (y-\\mu)^2\\)\nAbsolute-error:\n\\(\\ell(y,\\ \\mu) = |y-\\mu|\\)\nZero-One:\n\\(\\ell(y,\\ \\mu) = I(y\\neq\\mu)=\\begin{cases} 0 & y=\\mu\\\\1 & \\mbox{else}\\end{cases}\\)\nCauchy:\n\\(\\ell(y,\\ \\mu) = \\log(1 + (y - \\mu)^2)\\)\n\n\n\n\n\nCode\nggplot() +\n xlim(-2, 2) +\n geom_function(fun = ~log(1+.x^2), colour = 'purple', linewidth = 2) +\n geom_function(fun = ~.x^2, colour = tertiary, linewidth = 2) +\n geom_function(fun = ~abs(.x), colour = primary, linewidth = 2) +\n geom_line(\n data = tibble(x = seq(-2, 2, length.out = 100), y = as.numeric(x != 0)), \n aes(x, y), colour = orange, linewidth = 2) +\n geom_point(data = tibble(x = 0, y = 0), aes(x, y), \n colour = orange, pch = 16, size = 3) +\n ylab(bquote(\"\\u2113\" * (y - mu))) + xlab(bquote(y - mu))"
+ "section": "One can show… (wait for the proof)",
+ "text": "One can show… (wait for the proof)\n\\(\\widehat Y_a = a\\overline{Y}_n\\) for some \\(a \\in (0,1]\\)\n\\[\nR_n(\\widehat Y_a) = \\Expect{(\\widehat Y_a-Y)^2} = (1 - a)^2\\mu^2 +\n\\frac{a^2}{n} +1\n\\]\n\nWe can minimize this in \\(a\\) to get the best possible prediction risk for an estimator of the form \\(\\widehat Y_a\\):\n\\[\n\\argmin_{a} R_n(\\widehat Y_a) = \\left(\\frac{\\mu^2}{\\mu^2 + 1/n} \\right)\n\\]\n\n\nWhat happens if \\(\\mu \\ll 1\\)?"
},
{
- "objectID": "schedule/slides/03-regression-function.html#start-with-expected-squared-error",
- "href": "schedule/slides/03-regression-function.html#start-with-expected-squared-error",
+ "objectID": "schedule/slides/04-bias-variance.html#section-1",
+ "href": "schedule/slides/04-bias-variance.html#section-1",
"title": "UBC Stat406 2023W",
- "section": "Start with (Expected) Squared Error",
- "text": "Start with (Expected) Squared Error\nLet’s try to minimize the expected squared error (MSE).\nClaim: \\(\\mu(X) = \\Expect{Y\\ \\vert\\ X}\\) minimizes MSE.\nThat is, for any \\(r(X)\\), \\(\\Expect{(Y - \\mu(X))^2} \\leq \\Expect{(Y-r(X))^2}\\).\n\nProof of Claim:\n\\[\\begin{aligned}\n\\Expect{(Y-r(X))^2}\n&= \\Expect{(Y- \\mu(X) + \\mu(X) - r(X))^2}\\\\\n&= \\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2\\Expect{(Y- \\mu(X))(\\mu(X) - r(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} \\\\\n&\\quad +2(\\mu(X) - r(X))\\Expect{(Y- \\mu(X))}\\\\\n&=\\Expect{(Y- \\mu(X))^2} + \\Expect{(\\mu(X) - r(X))^2} + 0\\\\\n&\\geq \\Expect{(Y- \\mu(X))^2}\n\\end{aligned}\\]"
+ "section": "",
+ "text": "Important\n\n\n\nWait a minute! I’m saying there is a better estimator than \\(\\overline{Y}_n\\)!"
},
{
- "objectID": "schedule/slides/03-regression-function.html#the-regression-function",
- "href": "schedule/slides/03-regression-function.html#the-regression-function",
+ "objectID": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-estimating-the-mean",
+ "href": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-estimating-the-mean",
"title": "UBC Stat406 2023W",
- "section": "The regression function",
- "text": "The regression function\nSometimes people call this solution:\n\\[\\mu(X) = \\Expect{Y \\ \\vert\\ X}\\]\nthe regression function. (But don’t forget that it depended on \\(\\ell\\).)\nIf we assume that \\(\\mu(x) = \\Expect{Y \\ \\vert\\ X=x} = x^\\top \\beta\\), then we get back exactly OLS.\n\nBut why should we assume \\(\\mu(x) = x^\\top \\beta\\)?"
+ "section": "Bias-variance tradeoff: Estimating the mean",
+ "text": "Bias-variance tradeoff: Estimating the mean\n\\[\nR_n(\\widehat Y_a) = (a - 1)^2\\mu^2 + \\frac{a^2}{n} + \\sigma^2\n\\]\n\nmu = 1; n = 5; sig = 1"
},
{
- "objectID": "schedule/slides/03-regression-function.html#brief-aside",
- "href": "schedule/slides/03-regression-function.html#brief-aside",
+ "objectID": "schedule/slides/04-bias-variance.html#to-restate",
+ "href": "schedule/slides/04-bias-variance.html#to-restate",
"title": "UBC Stat406 2023W",
- "section": "Brief aside",
- "text": "Brief aside\nSome notation / terminology\n\n“Hats” on things mean “estimates”, so \\(\\widehat{\\mu}\\) is an estimate of \\(\\mu\\)\nParameters are “properties of the model”, so \\(f_X(x)\\) or \\(\\mu\\) or \\(\\Var{Y}\\)\nRandom variables like \\(X\\), \\(Y\\), \\(Z\\) may eventually become data, \\(x\\), \\(y\\), \\(z\\), once observed.\n“Estimating” means “using observations to estimate parameters”\n“Predicting” means “using observations to predict future data”\nOften, there is a parameter whose estimate will provide a prediction.\n\n\n\nThis last point can lead to confusion."
+ "section": "To restate",
+ "text": "To restate\nIf \\(\\mu=\\) 1 and \\(n=\\) 5\nthen it is better to predict with 0.83 \\(\\overline{Y}_5\\)\nthan with \\(\\overline{Y}_5\\) itself.\n\nFor this \\(a =\\) 0.83 and \\(n=5\\)\n\n\\(R_5(\\widehat{Y}_a) =\\) 1.17\n\\(R_5(\\overline{Y}_5)=\\) 1.2"
},
{
- "objectID": "schedule/slides/03-regression-function.html#the-regression-function-1",
- "href": "schedule/slides/03-regression-function.html#the-regression-function-1",
+ "objectID": "schedule/slides/04-bias-variance.html#prediction-risk",
+ "href": "schedule/slides/04-bias-variance.html#prediction-risk",
"title": "UBC Stat406 2023W",
- "section": "The regression function",
- "text": "The regression function\nIn mathematics: \\(\\mu(x) = \\Expect{Y \\ \\vert\\ X=x}\\).\nIn words:\nRegression with squared-error loss is really about estimating the (conditional) mean.\n\nIf \\(Y\\sim \\textrm{N}(\\mu,\\ 1)\\), our best guess for a new \\(Y\\) is \\(\\mu\\).\nFor regression, we let the mean \\((\\mu)\\) depend on \\(X\\).\n\nThink of \\(Y\\sim \\textrm{N}(\\mu(X),\\ 1)\\), then conditional on \\(X=x\\), our best guess for a new \\(Y\\) is \\(\\mu(x)\\)\n\n[whatever this function \\(\\mu\\) is]"
+ "section": "Prediction risk",
+ "text": "Prediction risk\n(Now using generic prediction function \\(f\\))\n\\[\nR_n(f) = \\Expect{(Y - f(X))^2}\n\\]\nWhy should we care about \\(R_n(f)\\)?\n👍 Measures predictive accuracy on average.\n👍 How much confidence should you have in \\(f\\)’s predictions.\n👍 Compare with other predictors: \\(R_n(f)\\) vs \\(R_n(g)\\)\n🤮 This is hard: Don’t know the distribution of the data (if I knew the truth, this would be easy)"
},
{
- "objectID": "schedule/slides/03-regression-function.html#anything-strange",
- "href": "schedule/slides/03-regression-function.html#anything-strange",
+ "objectID": "schedule/slides/04-bias-variance.html#bias-variance-decomposition",
+ "href": "schedule/slides/04-bias-variance.html#bias-variance-decomposition",
"title": "UBC Stat406 2023W",
- "section": "Anything strange?",
- "text": "Anything strange?\nFor any two variables \\(Y\\) and \\(X\\), we can always write\n\\[Y = E[Y\\given X] + (Y - E[Y\\given X]) = \\mu(X) + \\eta(X)\\]\nsuch that \\(\\Expect{\\eta(X)}=0\\).\n\n\nSuppose, \\(\\mu(X)=\\mu_0\\) (constant in \\(X\\)), are \\(Y\\) and \\(X\\) independent?\n\n\n\n\nSuppose \\(Y\\) and \\(X\\) are independent, is \\(\\mu(X)=\\mu_0\\)?\n\n\n\n\nFor more practice on this see the Fun Worksheet on Theory and solutions\nIn this course, I do not expect you to be able to create this math, but understanding and explaining it is important."
+ "section": "Bias-variance decomposition",
+ "text": "Bias-variance decomposition\n\\[R_n(\\widehat{Y}_a)=(a - 1)^2\\mu^2 + \\frac{a^2}{n} + 1\\]\n\nprediction risk = \\(\\textrm{bias}^2\\) + variance + irreducible error\nestimation risk = \\(\\textrm{bias}^2\\) + variance\n\nWhat is \\(R_n(\\widehat{Y}_a)\\) for our estimator \\(\\widehat{Y}_a=a\\overline{Y}_n\\)?\n\\[\\begin{aligned}\n\\textrm{bias}(\\widehat{Y}_a) &= \\Expect{a\\overline{Y}_n} - \\mu=(a-1)\\mu\\\\\n\\textrm{var}(\\widehat f(x)) &= \\Expect{ \\left(a\\overline{Y}_n - \\Expect{a\\overline{Y}_n}\\right)^2}\n=a^2\\Expect{\\left(\\overline{Y}_n-\\mu\\right)^2}=\\frac{a^2}{n} \\\\\n\\sigma^2 &= \\Expect{(Y-\\mu)^2}=1\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/03-regression-function.html#what-do-we-mean-by-good-predictions",
- "href": "schedule/slides/03-regression-function.html#what-do-we-mean-by-good-predictions",
+ "objectID": "schedule/slides/04-bias-variance.html#this-decomposition-holds-generally",
+ "href": "schedule/slides/04-bias-variance.html#this-decomposition-holds-generally",
"title": "UBC Stat406 2023W",
- "section": "What do we mean by good predictions?",
- "text": "What do we mean by good predictions?\nWe make observations and then attempt to “predict” new, unobserved data.\nSometimes this is the same as estimating the (conditional) mean.\nMostly, we observe \\((y_1,x_1),\\ \\ldots,\\ (y_n,x_n)\\), and we want some way to predict \\(Y\\) from \\(X\\)."
+ "section": "This decomposition holds generally",
+ "text": "This decomposition holds generally\n\\[\\begin{aligned}\nR_n(\\hat{Y})\n&= \\Expect{(Y-\\hat{Y})^2} \\\\\n&= \\Expect{(Y-\\mu + \\mu - \\hat{Y})^2} \\\\\n&= \\Expect{(Y-\\mu)^2} + \\Expect{(\\mu - \\hat{Y})^2} +\n2\\Expect{(Y-\\mu)(\\mu-\\hat{Y})}\\\\\n&= \\Expect{(Y-\\mu)^2} + \\Expect{(\\mu - \\hat{Y})^2} + 0\\\\\n&= \\text{irr. error} + \\text{estimation risk}\\\\\n&= \\sigma^2 + \\Expect{(\\mu - E[\\hat{Y}] + E[\\hat{Y}] - \\hat{Y})^2}\\\\\n&= \\sigma^2 + \\Expect{(\\mu - E[\\hat{Y}])^2} + \\Expect{(E[\\hat{Y}] - \\hat{Y})^2} + 2\\Expect{(\\mu-E[\\hat{Y}])(E[\\hat{Y}] - \\hat{Y})}\\\\\n&= \\sigma^2 + \\Expect{(\\mu - E[\\hat{Y}])^2} + \\Expect{(E[\\hat{Y}] - \\hat{Y})^2} + 0\\\\\n&= \\text{irr. error} + \\text{squared bias} + \\text{variance}\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/03-regression-function.html#expected-test-mse",
- "href": "schedule/slides/03-regression-function.html#expected-test-mse",
+ "objectID": "schedule/slides/04-bias-variance.html#bias-variance-decomposition-1",
+ "href": "schedule/slides/04-bias-variance.html#bias-variance-decomposition-1",
"title": "UBC Stat406 2023W",
- "section": "Expected test MSE",
- "text": "Expected test MSE\nFor regression applications, we will use squared-error loss:\n\\(R_n(\\widehat{\\mu}) = \\Expect{(Y-\\widehat{\\mu}(X))^2}\\)\n\nI’m giving this a name, \\(R_n\\) for ease.\nDifferent than text.\nThis is expected test MSE."
+ "section": "Bias-variance decomposition",
+ "text": "Bias-variance decomposition\n\\[\\begin{aligned}\nR_n(\\hat{Y})\n&= \\Expect{(Y-\\hat{Y})^2} \\\\\n&= \\text{irr. error} + \\text{estimation risk}\\\\\n&= \\text{irr. error} + \\text{squared bias} + \\text{variance}\n\\end{aligned}\\]\n\n\n\n\n\n\nImportant\n\n\n\nImplication: prediction risk is proportional to estimation risk. However, defining estimation risk requires stronger assumptions.\n\n\n\n\n\n\n\n\n\n\n\nTip\n\n\nIn order to make good predictions, we want our prediction risk to be small. This means that we want to “balance” the bias and variance."
},
{
- "objectID": "schedule/slides/03-regression-function.html#example-estimatingpredicting-the-conditional-mean",
- "href": "schedule/slides/03-regression-function.html#example-estimatingpredicting-the-conditional-mean",
+ "objectID": "schedule/slides/04-bias-variance.html#section-2",
+ "href": "schedule/slides/04-bias-variance.html#section-2",
"title": "UBC Stat406 2023W",
- "section": "Example: Estimating/Predicting the (conditional) mean",
- "text": "Example: Estimating/Predicting the (conditional) mean\nSuppose we know that we want to predict a quantity \\(Y\\),\nwhere \\(\\Expect{Y}= \\mu \\in \\mathbb{R}\\) and \\(\\Var{Y} = 1\\).\nOur data is \\(\\{y_1,\\ldots,y_n\\}\\)\nClaim: We want to estimate \\(\\mu\\).\n\nWhy?"
+ "section": "",
+ "text": "Code\ncols = c(blue, red, green, orange)\npar(mfrow = c(2, 2), bty = \"n\", ann = FALSE, xaxt = \"n\", yaxt = \"n\", \n family = \"serif\", mar = c(0, 0, 0, 0), oma = c(0, 2, 2, 0))\nlibrary(mvtnorm)\nmv <- matrix(c(0, 0, 0, 0, -.5, -.5, -.5, -.5), 4, byrow = TRUE)\nva <- matrix(c(.02, .02, .1, .1, .02, .02, .1, .1), 4, byrow = TRUE)\n\nfor (i in 1:4) {\n plot(0, 0, ylim = c(-2, 2), xlim = c(-2, 2), pch = 19, cex = 42, \n col = blue, ann = FALSE, pty = \"s\")\n points(0, 0, pch = 19, cex = 30, col = \"white\")\n points(0, 0, pch = 19, cex = 18, col = green)\n points(0, 0, pch = 19, cex = 6, col = orange)\n points(rmvnorm(20, mean = mv[i, ], sigma = diag(va[i, ])), cex = 1, pch = 19)\n switch(i,\n \"1\" = {\n mtext(\"low variance\", 3, cex = 2)\n mtext(\"low bias\", 2, cex = 2)\n },\n \"2\" = mtext(\"high variance\", 3, cex = 2),\n \"3\" = mtext(\"high bias\", 2, cex = 2)\n )\n}"
},
{
- "objectID": "schedule/slides/03-regression-function.html#estimating-the-mean",
- "href": "schedule/slides/03-regression-function.html#estimating-the-mean",
+ "objectID": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-overview",
+ "href": "schedule/slides/04-bias-variance.html#bias-variance-tradeoff-overview",
"title": "UBC Stat406 2023W",
- "section": "Estimating the mean",
- "text": "Estimating the mean\n\nLet \\(\\widehat{Y}=\\overline{Y}_n\\) be the sample mean.\n\nWe can ask about the estimation risk (since we’re estimating \\(\\mu\\)):\n\n\n\n\\[\\begin{aligned}\n E[(\\overline{Y}_n-\\mu)^2]\n &= E[\\overline{Y}_n^2]\n -2\\mu E[\\overline{Y}_n] + \\mu^2 \\\\\n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 +\n \\mu^2\\\\ &= \\frac{1}{n}\n\\end{aligned}\\]\n\n\nUseful trick\nFor any \\(Z\\),\n\\(\\Var{Z} = \\Expect{Z^2} - \\Expect{Z}^2\\).\nTherefore:\n\\(\\Expect{Z^2} = \\Var{Z} + \\Expect{Z}^2\\)."
+ "section": "Bias-variance tradeoff: Overview",
+ "text": "Bias-variance tradeoff: Overview\nbias: how well does \\(\\widehat{f}(x)\\) approximate the truth \\(\\Expect{Y\\given X=x}\\)\n\nIf we allow more complicated possible \\(\\widehat{f}\\), lower bias. Flexibility \\(\\Rightarrow\\) Expressivity\nBut, more flexibility \\(\\Rightarrow\\) larger variance\nComplicated models are hard to estimate precisely for fixed \\(n\\)\nIrreducible error\n\n\n\nSadly, that whole exercise depends on knowing the truth to evaluate \\(E\\ldots\\)"
},
{
- "objectID": "schedule/slides/03-regression-function.html#predicting-new-ys",
- "href": "schedule/slides/03-regression-function.html#predicting-new-ys",
+ "objectID": "schedule/slides/06-information-criteria.html#meta-lecture",
+ "href": "schedule/slides/06-information-criteria.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Predicting new Y’s",
- "text": "Predicting new Y’s\n\nLet \\(\\widehat{Y}=\\overline{Y}_n\\) be the sample mean.\n\nWhat is the prediction risk of \\(\\overline{Y}\\)?\n\n\n\n\\[\\begin{aligned}\n R_n(\\overline{Y}_n)\n &= \\E[(\\overline{Y}_n-Y)^2]\\\\\n &= \\E[\\overline{Y}_{n}^{2}] -2\\E[\\overline{Y}_n Y] + \\E[Y^2] \\\\\n &= \\mu^2 + \\frac{1}{n} - 2\\mu^2 + \\mu^2 + 1 \\\\\n &= 1 + \\frac{1}{n}\n\\end{aligned}\\]\n\n\nTricks:\nUsed the variance thing again.\nIf \\(X\\) and \\(Z\\) are independent, then \\(\\Expect{XZ} = \\Expect{X}\\Expect{Z}\\)"
+ "section": "06 Information criteria",
+ "text": "06 Information criteria\nStat 406\nDaniel J. McDonald\nLast modified – 26 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/03-regression-function.html#predicting-new-ys-1",
- "href": "schedule/slides/03-regression-function.html#predicting-new-ys-1",
+ "objectID": "schedule/slides/06-information-criteria.html#generalized-cv",
+ "href": "schedule/slides/06-information-criteria.html#generalized-cv",
"title": "UBC Stat406 2023W",
- "section": "Predicting new Y’s",
- "text": "Predicting new Y’s\n\nWhat is the prediction risk of guessing \\(Y=0\\)?\nYou can probably guess that this is a stupid idea.\nLet’s show why it’s stupid.\n\n\\[\\begin{aligned}\n R_n(0) &= \\E[(0-Y)^2] = 1 + \\mu^2\n\\end{aligned}\\]"
+ "section": "Generalized CV",
+ "text": "Generalized CV\nLast time we saw a nice trick, that works some of the time (OLS, Ridge regression,…)\n\\[\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.\\]\n\n\\(\\widehat{\\y} = \\widehat{f}(\\mathbf{X}) = \\mathbf{H}\\mathbf{y}\\) for some matrix \\(\\mathbf{H}\\).\nA technical thing.\n\n\\[\\newcommand{\\H}{\\mathbf{H}}\\]"
},
{
- "objectID": "schedule/slides/03-regression-function.html#predicting-new-ys-2",
- "href": "schedule/slides/03-regression-function.html#predicting-new-ys-2",
+ "objectID": "schedule/slides/06-information-criteria.html#this-is-another-nice-trick.",
+ "href": "schedule/slides/06-information-criteria.html#this-is-another-nice-trick.",
"title": "UBC Stat406 2023W",
- "section": "Predicting new Y’s",
- "text": "Predicting new Y’s\n\nWhat is the prediction risk of guessing \\(Y=\\mu\\)?\nThis is a great idea, but we don’t know \\(\\mu\\).\nLet’s see what happens anyway.\n\n\\[\\begin{aligned}\n R_n(\\mu) &= \\E[(Y-\\mu)^2]= 1\n\\end{aligned}\\]"
+ "section": "This is another nice trick.",
+ "text": "This is another nice trick.\nIdea: replace \\(h_{ii}\\) with \\(\\frac{1}{n}\\sum_{i=1}^n h_{ii} = \\frac{1}{n}\\textrm{tr}(\\mathbf{H})\\)\nLet’s call \\(\\textrm{tr}(\\mathbf{H})\\) the degrees-of-freedom (or just df)\n\\[\\textrm{GCV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-\\textrm{df}/n)^2} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}\\]\nWhere does this stuff come from?"
},
{
- "objectID": "schedule/slides/03-regression-function.html#risk-relations",
- "href": "schedule/slides/03-regression-function.html#risk-relations",
+ "objectID": "schedule/slides/06-information-criteria.html#what-are-hatvalues",
+ "href": "schedule/slides/06-information-criteria.html#what-are-hatvalues",
"title": "UBC Stat406 2023W",
- "section": "Risk relations",
- "text": "Risk relations\nPrediction risk: \\(R_n(\\overline{Y}_n) = 1 + \\frac{1}{n}\\)\nEstimation risk: \\(E[(\\overline{Y}_n - \\mu)^2] = \\frac{1}{n}\\)\nThere is actually a nice interpretation here:\n\nThe common \\(1/n\\) term is \\(\\Var{\\overline{Y}_n}\\)\n\nThe extra factor of \\(1\\) in the prediction risk is irreducible error\n\n\\(Y\\) is a random variable, and hence noisy.\nWe can never eliminate it’s intrinsic variance.\n\nIn other words, even if we knew \\(\\mu\\), we could never get closer than \\(1\\), on average.\n\n\nIntuitively, \\(\\overline{Y}_n\\) is the obvious thing to do.\nBut what about unintuitive things…"
+ "section": "What are hatvalues?",
+ "text": "What are hatvalues?\n\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n\nIn OLS, \\(\\widehat{\\y} = \\X\\widehat{\\beta} = \\X(\\X^\\top \\X)^{-1}\\X^\\top \\y\\)\nWe often call \\(\\mathbf{H} = \\X(\\X^\\top \\X)^{-1}\\X^\\top\\) the Hat matrix, because it puts the hat on \\(\\y\\)\nGCV uses \\(\\textrm{tr}(\\mathbf{H})\\).\nFor lm(), this is just p, the number of predictors (Why?)\nThis is one way of understanding the name degrees-of-freedom"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#meta-lecture",
- "href": "schedule/slides/05-estimating-test-mse.html#meta-lecture",
+ "objectID": "schedule/slides/06-information-criteria.html#alternative-interpretation",
+ "href": "schedule/slides/06-information-criteria.html#alternative-interpretation",
"title": "UBC Stat406 2023W",
- "section": "05 Estimating test MSE",
- "text": "05 Estimating test MSE\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "Alternative interpretation:",
+ "text": "Alternative interpretation:\nSuppose, \\(Y_i\\) is independent from some distribution with mean \\(\\mu_i\\) and variance \\(\\sigma^2\\)\n(remember: in the linear model \\(\\Expect{Y_i} = x_i^\\top \\beta = \\mu_i\\) )\nLet \\(\\widehat{\\mathbf{Y}}\\) be an estimator of \\(\\mu\\) (all \\(i=1,\\ldots,n\\) elements of the vector).\n\n\\[\\begin{aligned}\n& \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \\\\\n&= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-Y_i + Y_i -\\mu_i)^2}\\\\\n&= \\frac{1}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)^2} + \\frac{1}{n}\\Expect{\\sum (Y_i-\\mu_i)^2} + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)}\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} + \\sigma^2 + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)} = \\cdots =\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#estimating-prediction-risk",
- "href": "schedule/slides/05-estimating-test-mse.html#estimating-prediction-risk",
+ "objectID": "schedule/slides/06-information-criteria.html#alternative-interpretation-1",
+ "href": "schedule/slides/06-information-criteria.html#alternative-interpretation-1",
"title": "UBC Stat406 2023W",
- "section": "Estimating prediction risk",
- "text": "Estimating prediction risk\nLast time, we saw\n\\(R_n(\\widehat{f}) = E[(Y-\\widehat{f}(X))^2]\\)\nprediction risk = \\(\\textrm{bias}^2\\) + variance + irreducible error\nWe argued that we want procedures that produce \\(\\widehat{f}\\) with small \\(R_n\\).\n\nHow do we estimate \\(R_n\\)?"
+ "section": "Alternative interpretation:",
+ "text": "Alternative interpretation:\n\\[\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} = \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\\]\nNow, if \\(\\widehat{\\mathbf{Y}} = \\H \\mathbf{Y}\\) for some matrix \\(\\H\\),\n\\(\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{\\mathbf{Y}^\\top \\H \\mathbf{Y}} = \\sigma^2 \\textrm{tr}(\\H)\\)\nThis gives Mallow’s \\(C_p\\) aka Stein’s Unbiased Risk Estimator:\n\\(MSE + 2\\hat{\\sigma}^2\\textrm{df}/n\\)\n\n\n\n\n\n\nImportant\n\n\nUnfortunately, df may be difficult or impossible to calculate for complicated prediction methods. But one can often estimate it well. This idea is beyond the level of this course."
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#dont-use-training-error",
- "href": "schedule/slides/05-estimating-test-mse.html#dont-use-training-error",
+ "objectID": "schedule/slides/06-information-criteria.html#aic-and-bic",
+ "href": "schedule/slides/06-information-criteria.html#aic-and-bic",
"title": "UBC Stat406 2023W",
- "section": "Don’t use training error",
- "text": "Don’t use training error\nThe training error in regression is\n\\[\\widehat{R}_n(\\widehat{f}) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{f}(x_i))^2\\]\nHere, the \\(n\\) is doubly used (annoying, but simple): \\(n\\) observations to create \\(\\widehat{f}\\) and \\(n\\) terms in the sum.\n\n\n\n\n\n\nImportant\n\n\nTraining error is a bad estimator for \\(R_n(\\widehat{f})\\).\n\n\n\nSo we should never use it."
+ "section": "AIC and BIC",
+ "text": "AIC and BIC\nThese have a very similar flavor to \\(C_p\\), but their genesis is different.\nWithout going into too much detail, they look like\n\\(\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n\\)\n\\(\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n\\)\n\nIn the case of a linear model with Gaussian errors and \\(p\\) predictors\n\\[\\begin{aligned}\n\\textrm{AIC}/n &= \\log(2\\pi) + \\log(RSS/n) + 2(p+1)/n \\\\\n&\\propto \\log(RSS) + 2(p+1)/n\n\\end{aligned}\\]\n( \\(p+1\\) because of the unknown variance, intercept included in \\(p\\) or not)\n\n\n\n\n\n\n\n\nImportant\n\n\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here."
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#these-all-have-the-same-r2-and-training-error",
- "href": "schedule/slides/05-estimating-test-mse.html#these-all-have-the-same-r2-and-training-error",
+ "objectID": "schedule/slides/06-information-criteria.html#over-fitting-vs.-under-fitting",
+ "href": "schedule/slides/06-information-criteria.html#over-fitting-vs.-under-fitting",
"title": "UBC Stat406 2023W",
- "section": "These all have the same \\(R^2\\) and Training Error",
- "text": "These all have the same \\(R^2\\) and Training Error\n\n\n\n\nCode\nans <- anscombe |>\n pivot_longer(everything(), names_to = c(\".value\", \"set\"), \n names_pattern = \"(.)(.)\")\nggplot(ans, aes(x, y)) + \n geom_point(colour = orange, size = 3) + \n geom_smooth(method = \"lm\", se = FALSE, color = blue, linewidth = 2) +\n facet_wrap(~set, labeller = label_both)\n\n\n\n\n\n\n\n\n\n\n\n\nans %>% \n group_by(set) |> \n summarise(\n R2 = summary(lm(y ~ x))$r.sq, \n train_error = mean((y - predict(lm(y ~ x)))^2)\n ) |>\n kableExtra::kable(digits = 2)\n\n\n\n\nset\nR2\ntrain_error\n\n\n\n\n1\n0.67\n1.25\n\n\n2\n0.67\n1.25\n\n\n3\n0.67\n1.25\n\n\n4\n0.67\n1.25"
+ "section": "Over-fitting vs. Under-fitting",
+ "text": "Over-fitting vs. Under-fitting\n\nOver-fitting means estimating a really complicated function when you don’t have enough data.\n\nThis is likely a low-bias / high-variance situation.\n\nUnder-fitting means estimating a really simple function when you have lots of data.\n\nThis is likely a high-bias / low-variance situation.\nBoth of these outcomes are bad (they have high risk \\(=\\) big \\(R_n\\) ).\nThe best way to avoid them is to use a reasonable estimate of prediction risk to choose how complicated your model should be."
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#adding-junk-predictors-increases-r2-and-decreases-training-error",
- "href": "schedule/slides/05-estimating-test-mse.html#adding-junk-predictors-increases-r2-and-decreases-training-error",
+ "objectID": "schedule/slides/06-information-criteria.html#recommendations",
+ "href": "schedule/slides/06-information-criteria.html#recommendations",
"title": "UBC Stat406 2023W",
- "section": "Adding “junk” predictors increases \\(R^2\\) and decreases Training Error",
- "text": "Adding “junk” predictors increases \\(R^2\\) and decreases Training Error\n\nn <- 100\np <- 10\nq <- 0:30\nx <- matrix(rnorm(n * (p + max(q))), nrow = n)\ny <- x[, 1:p] %*% c(5:1, 1:5) + rnorm(n, 0, 10)\n\nregress_on_junk <- function(q) {\n x <- x[, 1:(p + q)]\n mod <- lm(y ~ x)\n tibble(R2 = summary(mod)$r.sq, train_error = mean((y - predict(mod))^2))\n}\n\n\n\nCode\nmap(q, regress_on_junk) |> \n list_rbind() |>\n mutate(q = q) |>\n pivot_longer(-q) |>\n ggplot(aes(q, value, colour = name)) +\n geom_line(linewidth = 2) + xlab(\"train_error\") +\n scale_colour_manual(values = c(blue, orange), guide = \"none\") +\n facet_wrap(~ name, scales = \"free_y\")"
+ "section": "Recommendations",
+ "text": "Recommendations\n\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.\nCV is usually easiest to make sense of and doesn’t depend on other unknown parameters.\nBut, it requires refitting the model.\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc."
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#other-things-you-cant-use",
- "href": "schedule/slides/05-estimating-test-mse.html#other-things-you-cant-use",
+ "objectID": "schedule/slides/06-information-criteria.html#high-level-intuition-of-these",
+ "href": "schedule/slides/06-information-criteria.html#high-level-intuition-of-these",
"title": "UBC Stat406 2023W",
- "section": "Other things you can’t use",
- "text": "Other things you can’t use\nYou should not use anova\nor the \\(p\\)-values from the lm output for this purpose.\n\nThese things are to determine whether those parameters are different from zero if you were to repeat the experiment many times, if the model were true, etc. etc.\n\nNot the same as “are they useful for prediction = do they help me get smaller \\(R_n\\)?”"
+ "section": "High-level intuition of these:",
+ "text": "High-level intuition of these:\n\nGCV tends to choose “dense” models.\nTheory says AIC chooses the “best predicting model” asymptotically.\nTheory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.\nIn some special cases, AIC = Cp = SURE \\(\\approx\\) LOO-CV\nAs a technical point, CV (or validation set) is estimating error on new data, unseen \\((X_0, Y_0)\\), while AIC / CP are estimating error on new Y at the observed \\(x_1,\\ldots,x_n\\). This is subtle.\n\n\n\nFor more information: see [ESL] Chapter 7. This material is more challenging than the level of this course, and is easily and often misunderstood."
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#risk-of-risk",
- "href": "schedule/slides/05-estimating-test-mse.html#risk-of-risk",
+ "objectID": "schedule/slides/06-information-criteria.html#a-few-more-caveats",
+ "href": "schedule/slides/06-information-criteria.html#a-few-more-caveats",
"title": "UBC Stat406 2023W",
- "section": "Risk of Risk",
- "text": "Risk of Risk\nWhile it’s crummy, Training Error is an estimator of \\(R_n(\\hat{f})\\)\nRecall, \\(R_n(\\hat{f})\\) is a parameter (a property of the data distribution)\nSo we can ask “is \\(\\widehat{R}(\\hat{f})\\) a good estimator for \\(R_n(\\hat{f})\\)?”\nBoth are just numbers, so perhaps a good way to measure is\n\\[\nE[(R_n - \\widehat{R})^2]\n= \\cdots\n= (R_n - E[\\widehat{R}])^2 + \\Var{\\widehat{R}}\n\\]\nChoices you make determine how good this is.\nWe can try to balance it’s bias and variance…"
+ "section": "A few more caveats",
+ "text": "A few more caveats\nIt is often tempting to “just compare” risk estimates from vastly different models.\nFor example,\n\ndifferent transformations of the predictors,\ndifferent transformations of the response,\nPoisson likelihood vs. Gaussian likelihood in glm()\n\nThis is not always justified.\n\nThe “high-level intuition” is for “nested” models.\nDifferent likelihoods aren’t comparable.\nResiduals / response variables on different scales aren’t directly comparable.\n\n“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks."
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#held-out-sets",
- "href": "schedule/slides/05-estimating-test-mse.html#held-out-sets",
+ "objectID": "schedule/slides/08-ridge-regression.html#meta-lecture",
+ "href": "schedule/slides/08-ridge-regression.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Held out sets",
- "text": "Held out sets\nOne option is to have a separate “held out” or “validation set”.\n👍 Estimates the test error\n👍 Fast computationally\n🤮 Estimate is random\n🤮 Estimate has high variance (depends on 1 choice of split)\n🤮 Estimate has some bias because we only used some of the data"
+ "section": "08 Ridge regression",
+ "text": "08 Ridge regression\nStat 406\nDaniel J. McDonald\nLast modified – 27 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#aside",
- "href": "schedule/slides/05-estimating-test-mse.html#aside",
+ "objectID": "schedule/slides/08-ridge-regression.html#recap",
+ "href": "schedule/slides/08-ridge-regression.html#recap",
"title": "UBC Stat406 2023W",
- "section": "Aside",
- "text": "Aside\nIn my experience, CS has particular definitions of “training”, “validation”, and “test” data.\nI think these are not quite the same as in Statistics.\n\nTest data - Hypothetical data you don’t get to see, ever. Infinite amounts drawn from the population.\n\nExpected test error or Risk is an expected value over this distribution. It’s not a sum over some data kept aside.\n\nSometimes I’ll give you “test data”. You pretend that this is a good representation of the expectation and use it to see how well you did on the training data.\nTraining data - This is data that you get to touch.\nValidation set - Often, we need to choose models. One way to do this is to split off some of your training data and pretend that it’s like a “Test Set”.\n\nWhen and how you split your training data can be very important."
+ "section": "Recap",
+ "text": "Recap\nSo far, we have emphasized model selection as\nDecide which predictors we would like to use in our linear model\nOr similarly:\nDecide which of a few linear models to use\nTo do this, we used a risk estimate, and chose the “model” with the lowest estimate\n\nMoving forward, we need to generalize this to\nDecide which of possibly infinite prediction functions \\(f\\in\\mathcal{F}\\) to use\nThankfully, this isn’t really any different. We still use those same risk estimates.\nRemember: We were choosing models that balance bias and variance (and hence have low prediction risk).\n\\[\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#intuition-for-cv",
- "href": "schedule/slides/05-estimating-test-mse.html#intuition-for-cv",
+ "objectID": "schedule/slides/08-ridge-regression.html#regularization",
+ "href": "schedule/slides/08-ridge-regression.html#regularization",
"title": "UBC Stat406 2023W",
- "section": "Intuition for CV",
- "text": "Intuition for CV\nOne reason that \\(\\widehat{R}_n(\\widehat{f})\\) is bad is that we are using the same data to pick \\(\\widehat{f}\\) AND to estimate \\(R_n\\).\n“Validation set” fixes this, but holds out a particular, fixed block of data we pretend mimics the “test data”\n\nWhat if we set aside one observation, say the first one \\((y_1, x_1)\\).\nWe estimate \\(\\widehat{f}^{(1)}\\) without using the first observation.\nThen we test our prediction:\n\\[\\widetilde{R}_1(\\widehat{f}^{(1)}) = (y_1 -\\widehat{f}^{(1)}(x_1))^2.\\]\n(why the notation \\(\\widetilde{R}_1\\)? Because we’re estimating the risk with 1 observation. )"
+ "section": "Regularization",
+ "text": "Regularization\n\nAnother way to control bias and variance is through regularization or shrinkage.\nRather than selecting a few predictors that seem reasonable, maybe trying a few combinations, use them all.\nI mean ALL.\nBut, make your estimates of \\(\\beta\\) “smaller”"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#keep-going",
- "href": "schedule/slides/05-estimating-test-mse.html#keep-going",
+ "objectID": "schedule/slides/08-ridge-regression.html#brief-aside-on-optimization",
+ "href": "schedule/slides/08-ridge-regression.html#brief-aside-on-optimization",
"title": "UBC Stat406 2023W",
- "section": "Keep going",
- "text": "Keep going\nBut that was only one data point \\((y_1, x_1)\\). Why stop there?\nDo the same with \\((y_2, x_2)\\)! Get an estimate \\(\\widehat{f}^{(2)}\\) without using it, then\n\\[\\widetilde{R}_1(\\widehat{f}^{(2)}) = (y_2 -\\widehat{f}^{(2)}(x_2))^2.\\]\nWe can keep doing this until we try it for every data point.\nAnd then average them! (Averages are good)\n\\[\\mbox{LOO-CV} = \\frac{1}{n}\\sum_{i=1}^n \\widetilde{R}_1(\\widehat{f}^{(i)}) = \\frac{1}{n}\\sum_{i=1}^n\n(y_i - \\widehat{f}^{(i)}(x_i))^2\\]\n\nThis is leave-one-out cross validation"
+ "section": "Brief aside on optimization",
+ "text": "Brief aside on optimization\n\nAn optimization problem has 2 components:\n\nThe “Objective function”: e.g. \\(\\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\\).\nThe “constraint”: e.g. “fewer than 5 non-zero entries in \\(\\beta\\)”.\n\nA constrained minimization problem is written\n\n\\[\\min_\\beta f(\\beta)\\;\\; \\mbox{ subject to }\\;\\; C(\\beta)\\]\n\n\\(f(\\beta)\\) is the objective function\n\\(C(\\beta)\\) is the constraint"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#problems-with-loo-cv",
- "href": "schedule/slides/05-estimating-test-mse.html#problems-with-loo-cv",
+ "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-constrained-version",
+ "href": "schedule/slides/08-ridge-regression.html#ridge-regression-constrained-version",
"title": "UBC Stat406 2023W",
- "section": "Problems with LOO-CV",
- "text": "Problems with LOO-CV\n🤮 Each held out set is small \\((n=1)\\). Therefore, the variance of the Squared Error of each prediction is high.\n🤮 The training sets overlap. This is bad.\n\nUsually, averaging reduces variance: \\(\\Var{\\overline{X}} = \\frac{1}{n^2}\\sum_{i=1}^n \\Var{X_i} = \\frac{1}{n}\\Var{X_1}.\\)\nBut only if the variables are independent. If not, then \\(\\Var{\\overline{X}} = \\frac{1}{n^2}\\Var{ \\sum_{i=1}^n X_i} = \\frac{1}{n}\\Var{X_1} + \\frac{1}{n^2}\\sum_{i\\neq j} \\Cov{X_i}{X_j}.\\)\nSince the training sets overlap a lot, that covariance can be pretty big.\n\n🤮 We have to estimate this model \\(n\\) times.\n🎉 Bias is low because we used almost all the data to fit the model: \\(E[\\mbox{LOO-CV}] = R_{n-1}\\)"
+ "section": "Ridge regression (constrained version)",
+ "text": "Ridge regression (constrained version)\nOne way to do this for regression is to solve (say): \\[\n\\minimize_\\beta \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\n\\quad \\st \\sum_j \\beta^2_j < s\n\\] for some \\(s>0\\).\n\nThis is called “ridge regression”.\nCall the minimizer of this problem \\(\\brt\\)\n\n\nCompare this to ordinary least squares:\n\\[\n\\minimize_\\beta \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\n\\quad \\st \\beta \\in \\R^p\n\\]"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#k-fold-cv",
- "href": "schedule/slides/05-estimating-test-mse.html#k-fold-cv",
+ "objectID": "schedule/slides/08-ridge-regression.html#geometry-of-ridge-regression-contours",
+ "href": "schedule/slides/08-ridge-regression.html#geometry-of-ridge-regression-contours",
"title": "UBC Stat406 2023W",
- "section": "K-fold CV",
- "text": "K-fold CV\n\n\nTo alleviate some of these problems, people usually use \\(K\\)-fold cross validation.\nThe idea of \\(K\\)-fold is\n\nDivide the data into \\(K\\) groups.\nLeave a group out and estimate with the rest.\nTest on the held-out group. Calculate an average risk over these \\(\\sim n/K\\) data.\nRepeat for all \\(K\\) groups.\nAverage the average risks.\n\n\n\n🎉 Less overlap, smaller covariance.\n🎉 Larger hold-out sets, smaller variance.\n🎉 Less computations (only need to estimate \\(K\\) times)\n🤮 LOO-CV is (nearly) unbiased for \\(R_n\\)\n🤮 K-fold CV is unbiased for \\(R_{n(1-1/K)}\\)\nThe risk depends on how much data you use to estimate the model. \\(R_n\\) depends on \\(n\\)."
+ "section": "Geometry of ridge regression (contours)",
+ "text": "Geometry of ridge regression (contours)\n\n\nCode\nlibrary(mvtnorm)\nnorm_ball <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- tibble(x = cos(tg), b = (1 - abs(x)^q)^(1 / q), bm = -b) |>\n pivot_longer(-x, values_to = \"y\")\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipse_data <- function(\n n = 75, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n expand_grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)) |>\n rowwise() |>\n mutate(z = dmvnorm(c(x, y), mean, Sigma))\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6, niter = 20) {\n ed <- filter(ed, x > 0, y > 0)\n feasible <- (ed$x^q + ed$y^q)^(1 / q) <= 1\n best <- ed[feasible, ]\n best[which.max(best$z), ]\n}\n\n\nnb <- norm_ball(2)\ned <- ellipse_data()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 2)\nggplot(nb, aes(x, y)) +\n xlim(-2, 2) +\n ylim(-2, 2) +\n geom_path(colour = red) +\n geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal() +\n geom_label(\n data = bols,\n mapping = aes(label = bquote(\"hat(beta)[ols]\")),\n parse = TRUE, \n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n theme_bw(base_size = 24) +\n geom_label(\n data = bhat,\n mapping = aes(label = bquote(\"hat(beta)[s]^R\")),\n parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#a-picture",
- "href": "schedule/slides/05-estimating-test-mse.html#a-picture",
+ "objectID": "schedule/slides/08-ridge-regression.html#brief-aside-on-norms",
+ "href": "schedule/slides/08-ridge-regression.html#brief-aside-on-norms",
"title": "UBC Stat406 2023W",
- "section": "A picture",
- "text": "A picture\n\n\nCode\npar(mar = c(0, 0, 0, 0))\nplot(NA, NA, ylim = c(0, 5), xlim = c(0, 10), bty = \"n\", yaxt = \"n\", xaxt = \"n\")\nrect(0, .1 + c(0, 2, 3, 4), 10, .9 + c(0, 2, 3, 4), col = blue, density = 10)\nrect(c(0, 1, 2, 9), rev(.1 + c(0, 2, 3, 4)), c(1, 2, 3, 10), \n rev(.9 + c(0, 2, 3, 4)), col = red, density = 10)\npoints(c(5, 5, 5), 1 + 1:3 / 4, pch = 19)\ntext(.5 + c(0, 1, 2, 9), .5 + c(4, 3, 2, 0), c(\"1\", \"2\", \"3\", \"K\"), cex = 3, \n col = red)\ntext(6, 4.5, \"Training data\", cex = 3, col = blue)\ntext(2, 1.5, \"Validation data\", cex = 3, col = red)"
+ "section": "Brief aside on norms",
+ "text": "Brief aside on norms\nRecall, for a vector \\(z \\in \\R^p\\)\n\\[\\snorm{z}_2 = \\sqrt{z_1^2 + z_2^2 + \\cdots + z^2_p} = \\sqrt{\\sum_{j=1}^p z_j^2}\\]\nSo,\n\\[\\snorm{z}^2_2 = z_1^2 + z_2^2 + \\cdots + z^2_p = \\sum_{j=1}^p z_j^2.\\]"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#code",
- "href": "schedule/slides/05-estimating-test-mse.html#code",
+ "objectID": "schedule/slides/08-ridge-regression.html#other-norms-we-should-remember",
+ "href": "schedule/slides/08-ridge-regression.html#other-norms-we-should-remember",
"title": "UBC Stat406 2023W",
- "section": "Code",
- "text": "Code\n\n#' @param data The full data set\n#' @param estimator Function. Has 1 argument (some data) and fits a model. \n#' @param predictor Function. Has 2 args (the fitted model, the_newdata) and produces predictions\n#' @param error_fun Function. Has one arg: the test data, with fits added.\n#' @param kfolds Integer. The number of folds.\nkfold_cv <- function(data, estimator, predictor, error_fun, kfolds = 5) {\n n <- nrow(data)\n fold_labels <- sample(rep(1:kfolds, length.out = n))\n errors <- double(kfolds)\n for (fold in seq_len(kfolds)) {\n test_rows <- fold_labels == fold\n train <- data[!test_rows, ]\n test <- data[test_rows, ]\n current_model <- estimator(train)\n test$.preds <- predictor(current_model, test)\n errors[fold] <- error_fun(test)\n }\n mean(errors)\n}\n\n\n\nsomedata <- data.frame(z = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))\nest <- function(dataset) lm(z ~ ., data = dataset)\npred <- function(mod, dataset) predict(mod, newdata = dataset)\nerror_fun <- function(testdata) mutate(testdata, errs = (z - .preds)^2) |> pull(errs) |> mean()\nkfold_cv(somedata, est, pred, error_fun, 5)\n\n[1] 0.9532271"
+ "section": "Other norms we should remember:",
+ "text": "Other norms we should remember:\n\n\\(\\ell_q\\)-norm\n\n\\(\\left(\\sum_{j=1}^p |z_j|^q\\right)^{1/q}\\)\n\n\\(\\ell_1\\)-norm (special case)\n\n\\(\\sum_{j=1}^p |z_j|\\)\n\n\\(\\ell_0\\)-norm\n\n\\(\\sum_{j=1}^p I(z_j \\neq 0 ) = \\lvert \\{j : z_j \\neq 0 \\}\\rvert\\)\n\n\\(\\ell_\\infty\\)-norm\n\n\\(\\max_{1\\leq j \\leq p} |z_j|\\)\n\n\n\n\nRecall what a norm is: https://en.wikipedia.org/wiki/Norm_(mathematics)"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#trick",
- "href": "schedule/slides/05-estimating-test-mse.html#trick",
+ "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression",
+ "href": "schedule/slides/08-ridge-regression.html#ridge-regression",
"title": "UBC Stat406 2023W",
- "section": "Trick",
- "text": "Trick\nFor a certain “nice” models, one can show\n(after pages of tedious algebra which I wouldn’t wish on my worst enemy, but might, in a fit of rage assign as homework to belligerent students)\n\\[\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.\\]\n\nThis trick means that you only have to fit the model once rather than \\(n\\) times!\nYou still have to calculate this for each model!\n\n\ncv_nice <- function(mdl) mean( (residuals(mdl) / (1 - hatvalues(mdl)))^2 )"
+ "section": "Ridge regression",
+ "text": "Ridge regression\nAn equivalent way to write\n\\[\\brt = \\argmin_{ || \\beta ||_2^2 \\leq s} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\\]\nis in the Lagrangian form\n\\[\\brl = \\argmin_{ \\beta} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\lambda || \\beta ||_2^2.\\]\nFor every \\(\\lambda\\) there is a unique \\(s\\) (and vice versa) that makes\n\\[\\brt = \\brl\\]"
},
{
- "objectID": "schedule/slides/05-estimating-test-mse.html#trick-1",
- "href": "schedule/slides/05-estimating-test-mse.html#trick-1",
+ "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-1",
+ "href": "schedule/slides/08-ridge-regression.html#ridge-regression-1",
"title": "UBC Stat406 2023W",
- "section": "Trick",
- "text": "Trick\n\ncv_nice <- function(mdl) mean( (residuals(mdl) / (1 - hatvalues(mdl)))^2 )\n\n“Nice” requires:\n\n\\(\\widehat{y}_i = h_i(\\mathbf{X})^\\top \\mathbf{y}\\) for some vector \\(h_i\\)\n\\(e^{(i)} = \\frac{\\widehat{e}_i}{(1-h_{ii})}\\)"
+ "section": "Ridge regression",
+ "text": "Ridge regression\n\\(\\brt = \\argmin_{ || \\beta ||_2^2 \\leq s} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2\\)\n\\(\\brl = \\argmin_{ \\beta} \\frac{1}{n}\\sum_i (y_i-x^\\top_i \\beta)^2 + \\lambda || \\beta ||_2^2\\)\nObserve:\n\n\\(\\lambda = 0\\) (or \\(s = \\infty\\)) makes \\(\\brl = \\bls\\)\nAny \\(\\lambda > 0\\) (or \\(s <\\infty\\)) penalizes larger values of \\(\\beta\\), effectively shrinking them.\n\n\\(\\lambda\\) and \\(s\\) are known as tuning parameters"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#meta-lecture",
- "href": "schedule/slides/07-greedy-selection.html#meta-lecture",
+ "objectID": "schedule/slides/08-ridge-regression.html#visualizing-ridge-regression-2-coefficients",
+ "href": "schedule/slides/08-ridge-regression.html#visualizing-ridge-regression-2-coefficients",
"title": "UBC Stat406 2023W",
- "section": "07 Greedy selection",
- "text": "07 Greedy selection\nStat 406\nDaniel J. McDonald\nLast modified – 18 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\]"
+ "section": "Visualizing ridge regression (2 coefficients)",
+ "text": "Visualizing ridge regression (2 coefficients)\n\n\nCode\nb <- c(1, 1)\nn <- 1000\nlams <- c(1, 5, 10)\nols_loss <- function(b1, b2) colMeans((y - X %*% rbind(b1, b2))^2) / 2\npen <- function(b1, b2, lambda = 1) lambda * (b1^2 + b2^2) / 2\ngr <- expand_grid(\n b1 = seq(b[1] - 0.5, b[1] + 0.5, length.out = 100),\n b2 = seq(b[2] - 0.5, b[2] + 0.5, length.out = 100)\n)\n\nX <- mvtnorm::rmvnorm(n, c(0, 0), sigma = matrix(c(1, .3, .3, .5), nrow = 2))\ny <- drop(X %*% b + rnorm(n))\n\nbols <- coef(lm(y ~ X - 1))\nbridge <- coef(MASS::lm.ridge(y ~ X - 1, lambda = lams * sqrt(n)))\n\npenalties <- lams |>\n set_names(~ paste(\"lam =\", .)) |>\n map(~ pen(gr$b1, gr$b2, .x)) |>\n as_tibble()\ngr <- gr |>\n mutate(loss = ols_loss(b1, b2)) |>\n bind_cols(penalties)\n\ng1 <- ggplot(gr, aes(b1, b2)) +\n geom_raster(aes(fill = loss)) +\n scale_fill_viridis_c(direction = -1) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +\n theme(legend.position = \"bottom\") +\n guides(fill = guide_colourbar(barwidth = 20, barheight = 0.5))\n\ng2 <- gr |>\n pivot_longer(starts_with(\"lam\")) |>\n mutate(name = factor(name, levels = paste(\"lam =\", lams))) |>\n ggplot(aes(b1, b2)) +\n geom_raster(aes(fill = value)) +\n scale_fill_viridis_c(direction = -1, name = \"penalty\") +\n facet_wrap(~name, ncol = 1) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +\n theme(legend.position = \"bottom\") +\n guides(fill = guide_colourbar(barwidth = 10, barheight = 0.5))\n\ng3 <- gr |> \n mutate(across(starts_with(\"lam\"), ~ loss + .x)) |>\n pivot_longer(starts_with(\"lam\")) |>\n mutate(name = factor(name, levels = paste(\"lam =\", lams))) |>\n ggplot(aes(b1, b2)) +\n geom_raster(aes(fill = value)) +\n scale_fill_viridis_c(direction = -1, name = \"loss + pen\") +\n facet_wrap(~name, ncol = 1) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2])) +\n theme(legend.position = \"bottom\") +\n guides(fill = guide_colourbar(barwidth = 10, barheight = 0.5))\n\ncowplot::plot_grid(g1, g2, g3, rel_widths = c(2, 1, 1), nrow = 1)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#recap",
- "href": "schedule/slides/07-greedy-selection.html#recap",
+ "objectID": "schedule/slides/08-ridge-regression.html#the-effect-on-the-estimates",
+ "href": "schedule/slides/08-ridge-regression.html#the-effect-on-the-estimates",
"title": "UBC Stat406 2023W",
- "section": "Recap",
- "text": "Recap\nModel Selection means select a family of distributions for your data.\nIdeally, we’d do this by comparing the \\(R_n\\) for one family with that for another.\nWe’d use whichever has smaller \\(R_n\\).\nBut \\(R_n\\) depends on the truth, so we estimate it with \\(\\widehat{R}\\).\nThen we use whichever has smaller \\(\\widehat{R}\\)."
+ "section": "The effect on the estimates",
+ "text": "The effect on the estimates\n\n\nCode\ngr |> \n mutate(z = ols_loss(b1, b2) + max(lams) * pen(b1, b2)) |>\n ggplot(aes(b1, b2)) +\n geom_raster(aes(fill = z)) +\n scale_fill_viridis_c(direction = -1) +\n geom_point(data = tibble(\n b1 = c(bols[1], bridge[,1]),\n b2 = c(bols[2], bridge[,2]),\n estimate = factor(c(\"ols\", paste0(\"ridge = \", lams)), \n levels = c(\"ols\", paste0(\"ridge = \", lams)))\n ),\n aes(shape = estimate), size = 3) +\n geom_point(data = data.frame(b1 = b[1], b2 = b[2]), colour = orange, size = 4)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#example",
- "href": "schedule/slides/07-greedy-selection.html#example",
+ "objectID": "schedule/slides/08-ridge-regression.html#example-data",
+ "href": "schedule/slides/08-ridge-regression.html#example-data",
"title": "UBC Stat406 2023W",
- "section": "Example",
- "text": "Example\nThe truth:\n\ndat <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100),\n y = 3 + x1 - 5 * x2 + sin(x1 * x2 / (2 * pi)) + rnorm(100, sd = 5)\n)\n\nModel 1: y ~ x1 + x2\nModel 2: y ~ x1 + x2 + x1*x2\nModel 3: y ~ x2 + sin(x1 * x2)\n\n(What are the families for each of these?)"
+ "section": "Example data",
+ "text": "Example data\nprostate data from [ESL]\n\ndata(prostate, package = \"ElemStatLearn\")\nprostate |> as_tibble()\n\n# A tibble: 97 × 10\n lcavol lweight age lbph svi lcp gleason pgg45 lpsa train\n <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <int> <dbl> <lgl>\n 1 -0.580 2.77 50 -1.39 0 -1.39 6 0 -0.431 TRUE \n 2 -0.994 3.32 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 3 -0.511 2.69 74 -1.39 0 -1.39 7 20 -0.163 TRUE \n 4 -1.20 3.28 58 -1.39 0 -1.39 6 0 -0.163 TRUE \n 5 0.751 3.43 62 -1.39 0 -1.39 6 0 0.372 TRUE \n 6 -1.05 3.23 50 -1.39 0 -1.39 6 0 0.765 TRUE \n 7 0.737 3.47 64 0.615 0 -1.39 6 0 0.765 FALSE\n 8 0.693 3.54 58 1.54 0 -1.39 6 0 0.854 TRUE \n 9 -0.777 3.54 47 -1.39 0 -1.39 6 0 1.05 FALSE\n10 0.223 3.24 63 -1.39 0 -1.39 6 0 1.05 FALSE\n# ℹ 87 more rows\n\n\n\nUse lpsa as response."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#fit-each-model-and-estimate-r_n",
- "href": "schedule/slides/07-greedy-selection.html#fit-each-model-and-estimate-r_n",
+ "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-path",
+ "href": "schedule/slides/08-ridge-regression.html#ridge-regression-path",
"title": "UBC Stat406 2023W",
- "section": "Fit each model and estimate \\(R_n\\)",
- "text": "Fit each model and estimate \\(R_n\\)\n\nforms <- list(\"y ~ x1 + x2\", \"y ~ x1 * x2\", \"y ~ x2 + sin(x1*x2)\") |> \n map(as.formula)\nfits <- map(forms, ~ lm(.x, data = dat))\nmap(fits, ~ tibble(\n R2 = summary(.x)$r.sq,\n training_error = mean(residuals(.x)^2),\n loocv = mean( (residuals(.x) / (1 - hatvalues(.x)))^2 ),\n AIC = AIC(.x),\n BIC = BIC(.x)\n)) |> list_rbind()\n\n# A tibble: 3 × 5\n R2 training_error loocv AIC BIC\n <dbl> <dbl> <dbl> <dbl> <dbl>\n1 0.589 21.3 22.9 598. 608.\n2 0.595 21.0 23.4 598. 611.\n3 0.586 21.4 23.0 598. 609."
+ "section": "Ridge regression path",
+ "text": "Ridge regression path\n\nY <- prostate$lpsa\nX <- model.matrix(~ ., data = prostate |> dplyr::select(-train, -lpsa))\nlibrary(glmnet)\nridge <- glmnet(x = X, y = Y, alpha = 0, lambda.min.ratio = .00001)\n\n\n\n\n\nplot(ridge, xvar = \"lambda\", lwd = 3)\n\n\n\n\n\n\n\n\n\n\nModel selection here:\n\nmeans choose some \\(\\lambda\\)\nA value of \\(\\lambda\\) is a vertical line.\nThis graphic is a “path” or “coefficient trace”\nCoefficients for varying \\(\\lambda\\)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#model-selection-vs.-variable-selection",
- "href": "schedule/slides/07-greedy-selection.html#model-selection-vs.-variable-selection",
+ "objectID": "schedule/slides/08-ridge-regression.html#solving-the-minimization",
+ "href": "schedule/slides/08-ridge-regression.html#solving-the-minimization",
"title": "UBC Stat406 2023W",
- "section": "Model Selection vs. Variable Selection",
- "text": "Model Selection vs. Variable Selection\nModel selection is very comprehensive\nYou choose a full statistical model (probability distribution) that will be hypothesized to have generated the data.\nVariable selection is a subset of this. It means\n\nchoosing which predictors to include in a predictive model\n\nEliminating a predictor, means removing it from the model.\nSome procedures automatically search predictors, and eliminate some.\nWe call this variable selection. But the procedure is implicitly selecting a model as well.\n\nMaking this all the more complicated, with lots of effort, we can map procedures/algorithms to larger classes of probability models, and analyze them."
+ "section": "Solving the minimization",
+ "text": "Solving the minimization\n\nOne nice thing about ridge regression is that it has a closed-form solution (like OLS)\n\n\\[\\brl = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y\\]\n\nThis is easy to calculate in R for any \\(\\lambda\\).\nHowever, computations and interpretation are simplified if we examine the Singular Value Decomposition of \\(\\X = \\mathbf{UDV}^\\top\\).\nRecall: any matrix has an SVD.\nHere \\(\\mathbf{D}\\) is diagonal and \\(\\mathbf{U}\\) and \\(\\mathbf{V}\\) are orthonormal: \\(\\mathbf{U}^\\top\\mathbf{U} = \\mathbf{I}\\)."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#selecting-variables-predictors-with-linear-methods",
- "href": "schedule/slides/07-greedy-selection.html#selecting-variables-predictors-with-linear-methods",
+ "objectID": "schedule/slides/08-ridge-regression.html#solving-the-minization",
+ "href": "schedule/slides/08-ridge-regression.html#solving-the-minization",
"title": "UBC Stat406 2023W",
- "section": "Selecting variables / predictors with linear methods",
- "text": "Selecting variables / predictors with linear methods\n\n\nSuppose we have a pile of predictors.\nWe estimate models with different subsets of predictors and use CV / Cp / AIC / BIC to decide which is preferred.\nSometimes you might have a few plausible subsets. Easy enough to choose with our criterion.\nSometimes you might just have a bunch of predictors, then what do you do?\n\n\n\nAll subsets\n\nestimate model based on every possible subset of size \\(|\\mathcal{S}| \\leq \\min\\{n, p\\}\\), use one with lowest risk estimate\n\nForward selection\n\nstart with \\(\\mathcal{S}=\\varnothing\\), add predictors greedily\n\nBackward selection\n\nstart with \\(\\mathcal{S}=\\{1,\\ldots,p\\}\\), remove greedily\n\nHybrid\n\ncombine forward and backward smartly"
+ "section": "Solving the minization",
+ "text": "Solving the minization\n\\[\\brl = (\\X^\\top\\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y\\]\n\nNote that \\(\\mathbf{X}^\\top\\mathbf{X} = \\mathbf{VDU}^\\top\\mathbf{UDV}^\\top = \\mathbf{V}\\mathbf{D}^2\\mathbf{V}^\\top\\).\nThen,\n\n\\[\\brl = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top + \\lambda \\mathbf{I})^{-1}\\mathbf{VDU}^\\top \\y\n= \\mathbf{V}(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1} \\mathbf{DU}^\\top \\y.\\]\n\nFor computations, now we only need to invert \\(\\mathbf{D}\\)."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#costs-and-benefits",
- "href": "schedule/slides/07-greedy-selection.html#costs-and-benefits",
+ "objectID": "schedule/slides/08-ridge-regression.html#comparing-with-ols",
+ "href": "schedule/slides/08-ridge-regression.html#comparing-with-ols",
"title": "UBC Stat406 2023W",
- "section": "Costs and benefits",
- "text": "Costs and benefits\n\nAll subsets\n\n👍 estimates each subset\n💣 takes \\(2^p\\) model fits when \\(p<n\\). If \\(p=50\\), this is about \\(10^{15}\\) models.\n\nForward selection\n\n👍 computationally feasible\n💣 ignores some models, correlated predictors means bad performance\n\nBackward selection\n\n👍 computationally feasible\n💣 ignores some models, correlated predictors means bad performance\n💣 doesn’t work if \\(p>n\\)\n\nHybrid\n\n👍 visits more models than forward/backward\n💣 slower"
+ "section": "Comparing with OLS",
+ "text": "Comparing with OLS\n\n\\(\\mathbf{D}\\) is a diagonal matrix\n\n\\[\\bls = (\\X^\\top\\X)^{-1}\\X^\\top \\y = (\\mathbf{VD}^2\\mathbf{V}^\\top)^{-1}\\mathbf{VDU}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-2}\\mathbf{D}}\\mathbf{U}^\\top \\y = \\mathbf{V}\\color{red}{\\mathbf{D}^{-1}}\\mathbf{U}^\\top \\y\\]\n\\[\\brl = (\\X^\\top \\X + \\lambda \\mathbf{I})^{-1}\\X^\\top \\y = \\mathbf{V}\\color{red}{(\\mathbf{D}^2+\\lambda \\mathbf{I})^{-1}} \\mathbf{DU}^\\top \\y.\\]\n\nNotice that \\(\\bls\\) depends on \\(d_j/d_j^2\\) while \\(\\brl\\) depends on \\(d_j/(d_j^2 + \\lambda)\\).\nRidge regression makes the coefficients smaller relative to OLS.\nBut if \\(\\X\\) has small singular values, ridge regression compensates with \\(\\lambda\\) in the denominator."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#synthetic-example",
- "href": "schedule/slides/07-greedy-selection.html#synthetic-example",
+ "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-and-multicollinearity",
+ "href": "schedule/slides/08-ridge-regression.html#ridge-regression-and-multicollinearity",
"title": "UBC Stat406 2023W",
- "section": "Synthetic example",
- "text": "Synthetic example\n\nset.seed(123)\nn <- 406\ndf <- tibble( # like data.frame, but columns can be functions of preceding\n x1 = rnorm(n),\n x2 = rnorm(n, mean = 2, sd = 1),\n x3 = rexp(n, rate = 1),\n x4 = x2 + rnorm(n, sd = .1), # correlated with x2\n x5 = x1 + rnorm(n, sd = .1), # correlated with x1\n x6 = x1 - x2 + rnorm(n, sd = .1), # correlated with x2 and x1 (and others)\n x7 = x1 + x3 + rnorm(n, sd = .1), # correlated with x1 and x3 (and others)\n y = x1 * 3 + x2 / 3 + rnorm(n, sd = 2.2) # function of x1 and x2 only\n)\n\n\n\\(\\mathbf{x}_1\\) and \\(\\mathbf{x}_2\\) are the true predictors\nBut the rest are correlated with them"
+ "section": "Ridge regression and multicollinearity",
+ "text": "Ridge regression and multicollinearity\nMulticollinearity: a linear combination of predictor variables is nearly equal to another predictor variable.\nSome comments:\n\nA better phrase: \\(\\X\\) is ill-conditioned\nAKA “(numerically) rank-deficient”.\n\\(\\X = \\mathbf{U D V}^\\top\\) ill-conditioned \\(\\Longleftrightarrow\\) some elements of \\(\\mathbf{D} \\approx 0\\)\n\\(\\bls= \\mathbf{V D}^{-1} \\mathbf{U}^\\top \\y\\), so small entries of \\(\\mathbf{D}\\) \\(\\Longleftrightarrow\\) huge elements of \\(\\mathbf{D}^{-1}\\)\nMeans huge variance: \\(\\Var{\\bls} = \\sigma^2(\\X^\\top \\X)^{-1} = \\sigma^2 \\mathbf{V D}^{-2} \\mathbf{V}^\\top\\)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#full-model",
- "href": "schedule/slides/07-greedy-selection.html#full-model",
+ "objectID": "schedule/slides/08-ridge-regression.html#ridge-regression-and-ill-posed-x",
+ "href": "schedule/slides/08-ridge-regression.html#ridge-regression-and-ill-posed-x",
"title": "UBC Stat406 2023W",
- "section": "Full model",
- "text": "Full model\n\nfull <- lm(y ~ ., data = df)\nsummary(full)\n\n\nCall:\nlm(formula = y ~ ., data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.7739 -1.4283 -0.0929 1.4257 7.5869 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.03383 0.27700 0.122 0.90287 \nx1 6.70481 2.06743 3.243 0.00128 **\nx2 -0.43945 1.71650 -0.256 0.79807 \nx3 1.37293 1.11524 1.231 0.21903 \nx4 -1.19911 1.17850 -1.017 0.30954 \nx5 -0.53918 1.07089 -0.503 0.61490 \nx6 -1.88547 1.21652 -1.550 0.12196 \nx7 -1.25245 1.10743 -1.131 0.25876 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.231 on 398 degrees of freedom\nMultiple R-squared: 0.6411, Adjusted R-squared: 0.6347 \nF-statistic: 101.5 on 7 and 398 DF, p-value: < 2.2e-16"
+ "section": "Ridge regression and ill-posed \\(\\X\\)",
+ "text": "Ridge regression and ill-posed \\(\\X\\)\nRidge Regression fixes this problem by preventing the division by a near-zero number\n\nConclusion\n\n\\((\\X^{\\top}\\X)^{-1}\\) can be really unstable, while \\((\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1}\\) is not.\n\nAside\n\nEngineering approach to solving linear systems is to always do this with small \\(\\lambda\\). The thinking is about the numerics rather than the statistics.\n\n\nWhich \\(\\lambda\\) to use?\n\nComputational\n\nUse CV and pick the \\(\\lambda\\) that makes this smallest.\n\nIntuition (bias)\n\nAs \\(\\lambda\\rightarrow\\infty\\), bias ⬆\n\nIntuition (variance)\n\nAs \\(\\lambda\\rightarrow\\infty\\), variance ⬇\n\n\nYou should think about why."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#true-model",
- "href": "schedule/slides/07-greedy-selection.html#true-model",
+ "objectID": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds",
+ "href": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds",
"title": "UBC Stat406 2023W",
- "section": "True model",
- "text": "True model\n\ntruth <- lm(y ~ x1 + x2, data = df)\nsummary(truth)\n\n\nCall:\nlm(formula = y ~ x1 + x2, data = df)\n\nResiduals:\n Min 1Q Median 3Q Max \n-6.4519 -1.3873 -0.1941 1.3498 7.5533 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 0.1676 0.2492 0.673 0.5015 \nx1 3.0316 0.1146 26.447 <2e-16 ***\nx2 0.2447 0.1109 2.207 0.0279 * \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.233 on 403 degrees of freedom\nMultiple R-squared: 0.6357, Adjusted R-squared: 0.6339 \nF-statistic: 351.6 on 2 and 403 DF, p-value: < 2.2e-16"
+ "section": "Can we get the best of both worlds?",
+ "text": "Can we get the best of both worlds?\nTo recap:\n\nDeciding which predictors to include, adding quadratic terms, or interactions is model selection (more precisely variable selection within a linear model).\nRidge regression provides regularization, which trades off bias and variance and also stabilizes multicollinearity.\nIf the LM is true,\n\nOLS is unbiased, but Variance depends on \\(\\mathbf{D}^{-2}\\). Can be big.\nRidge is biased (can you find the bias?). But Variance is smaller than OLS.\n\nRidge regression does not perform variable selection.\nBut picking \\(\\lambda=3.7\\) and thereby deciding to predict with \\(\\widehat{\\beta}^R_{3.7}\\) is model selection."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#all-subsets",
- "href": "schedule/slides/07-greedy-selection.html#all-subsets",
+ "objectID": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds-1",
+ "href": "schedule/slides/08-ridge-regression.html#can-we-get-the-best-of-both-worlds-1",
"title": "UBC Stat406 2023W",
- "section": "All subsets",
- "text": "All subsets\n\nlibrary(leaps)\ntrythemall <- regsubsets(y ~ ., data = df)\nsummary(trythemall)\n\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df)\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: exhaustive\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \"*\" \" \"\n3 ( 1 ) \"*\" \" \" \" \" \"*\" \" \" \"*\" \" \"\n4 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \" \"\n5 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \"*\"\n6 ( 1 ) \"*\" \" \" \"*\" \"*\" \"*\" \"*\" \"*\"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\""
+ "section": "Can we get the best of both worlds?",
+ "text": "Can we get the best of both worlds?\n\nRidge regression\n\n\\(\\minimize \\frac{1}{n}||\\y-\\X\\beta||_2^2 \\ \\st\\ ||\\beta||_2^2 \\leq s\\)\n\nBest (in-sample) linear regression model of size \\(s\\)\n\n\\(\\minimize \\frac{1}{n}||\\y-\\X\\beta||_2^2 \\ \\st\\ ||\\beta||_0 \\leq s\\)\n\n\n\\(||\\beta||_0\\) is the number of nonzero elements in \\(\\beta\\)\nFinding the best in-sample linear model (of size \\(s\\), among these predictors) is a nonconvex optimization problem (In fact, it is NP-hard)\nRidge regression is convex (easy to solve), but doesn’t do variable selection\nCan we somehow “interpolate” to get both?\nNote: selecting \\(\\lambda\\) is still model selection, but we’ve included all the variables."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#bic-and-cp",
- "href": "schedule/slides/07-greedy-selection.html#bic-and-cp",
+ "objectID": "schedule/slides/10-basis-expansions.html#meta-lecture",
+ "href": "schedule/slides/10-basis-expansions.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "BIC and Cp",
- "text": "BIC and Cp\n\n\ntibble(\n BIC = summary(trythemall)$bic, \n Cp = summary(trythemall)$cp,\n size = 1:7\n) |>\n pivot_longer(-size) |>\n ggplot(aes(size, value, colour = name)) + \n geom_point() + \n geom_line() + \n facet_wrap(~name, scales = \"free_y\") + \n ylab(\"\") +\n scale_colour_manual(\n values = c(blue, orange), \n guide = \"none\"\n )"
+ "section": "10 Basis expansions",
+ "text": "10 Basis expansions\nStat 406\nDaniel J. McDonald\nLast modified – 27 September 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#forward-stepwise",
- "href": "schedule/slides/07-greedy-selection.html#forward-stepwise",
+ "objectID": "schedule/slides/10-basis-expansions.html#what-about-nonlinear-things",
+ "href": "schedule/slides/10-basis-expansions.html#what-about-nonlinear-things",
"title": "UBC Stat406 2023W",
- "section": "Forward stepwise",
- "text": "Forward stepwise\n\nstepup <- regsubsets(y ~ ., data = df, method = \"forward\")\nsummary(stepup)\n\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df, method = \"forward\")\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: forward\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \"*\" \" \"\n3 ( 1 ) \"*\" \" \" \" \" \"*\" \" \" \"*\" \" \"\n4 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \" \"\n5 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \"*\"\n6 ( 1 ) \"*\" \" \" \"*\" \"*\" \"*\" \"*\" \"*\"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\""
+ "section": "What about nonlinear things",
+ "text": "What about nonlinear things\n\\[\\Expect{Y \\given X=x} = \\sum_{j=1}^p x_j\\beta_j\\]\nNow we relax this assumption of linearity:\n\\[\\Expect{Y \\given X=x} = f(x)\\]\nHow do we estimate \\(f\\)?\n\nFor this lecture, we use \\(x \\in \\R\\) (1 dimensional)\nHigher dimensions are possible, but complexity grows exponentially.\nWe’ll see some special techniques for \\(x\\in\\R^p\\) later this Module."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#bic-and-cp-1",
- "href": "schedule/slides/07-greedy-selection.html#bic-and-cp-1",
+ "objectID": "schedule/slides/10-basis-expansions.html#start-simple",
+ "href": "schedule/slides/10-basis-expansions.html#start-simple",
"title": "UBC Stat406 2023W",
- "section": "BIC and Cp",
- "text": "BIC and Cp\n\n\ntibble(\n BIC = summary(stepup)$bic,\n Cp = summary(stepup)$cp,\n size = 1:7\n) |>\n pivot_longer(-size) |>\n ggplot(aes(size, value, colour = name)) +\n geom_point() +\n geom_line() +\n facet_wrap(~name, scales = \"free_y\") +\n ylab(\"\") +\n scale_colour_manual(\n values = c(blue, orange),\n guide = \"none\"\n )"
+ "section": "Start simple",
+ "text": "Start simple\nFor any \\(f : \\R \\rightarrow [0,1]\\)\n\\[f(x) = f(x_0) + f'(x_0)(x-x_0) + \\frac{1}{2}f''(x_0)(x-x_0)^2 + \\frac{1}{3!}f'''(x_0)(x-x_0)^3 + R_3(x-x_0)\\]\nSo we can linearly regress \\(y_i = f(x_i)\\) on the polynomials.\nThe more terms we use, the smaller \\(R\\).\n\n\nCode\nset.seed(406406)\ndata(arcuate, package = \"Stat406\") \narcuate <- arcuate |> slice_sample(n = 220)\narcuate %>% \n ggplot(aes(position, fa)) + \n geom_point(color = blue) +\n geom_smooth(color = orange, formula = y ~ poly(x, 3), method = \"lm\", se = FALSE)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#backward-selection",
- "href": "schedule/slides/07-greedy-selection.html#backward-selection",
+ "objectID": "schedule/slides/10-basis-expansions.html#same-thing-different-orders",
+ "href": "schedule/slides/10-basis-expansions.html#same-thing-different-orders",
"title": "UBC Stat406 2023W",
- "section": "Backward selection",
- "text": "Backward selection\n\nstepdown <- regsubsets(y ~ ., data = df, method = \"backward\")\nsummary(stepdown)\n\nSubset selection object\nCall: regsubsets.formula(y ~ ., data = df, method = \"backward\")\n7 Variables (and intercept)\n Forced in Forced out\nx1 FALSE FALSE\nx2 FALSE FALSE\nx3 FALSE FALSE\nx4 FALSE FALSE\nx5 FALSE FALSE\nx6 FALSE FALSE\nx7 FALSE FALSE\n1 subsets of each size up to 7\nSelection Algorithm: backward\n x1 x2 x3 x4 x5 x6 x7 \n1 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \" \" \" \"\n2 ( 1 ) \"*\" \" \" \" \" \" \" \" \" \"*\" \" \"\n3 ( 1 ) \"*\" \" \" \" \" \"*\" \" \" \"*\" \" \"\n4 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \" \"\n5 ( 1 ) \"*\" \" \" \"*\" \"*\" \" \" \"*\" \"*\"\n6 ( 1 ) \"*\" \" \" \"*\" \"*\" \"*\" \"*\" \"*\"\n7 ( 1 ) \"*\" \"*\" \"*\" \"*\" \"*\" \"*\" \"*\""
+ "section": "Same thing, different orders",
+ "text": "Same thing, different orders\n\n\nCode\narcuate %>% \n ggplot(aes(position, fa)) + \n geom_point(color = blue) + \n geom_smooth(aes(color = \"a\"), formula = y ~ poly(x, 4), method = \"lm\", se = FALSE) +\n geom_smooth(aes(color = \"b\"), formula = y ~ poly(x, 7), method = \"lm\", se = FALSE) +\n geom_smooth(aes(color = \"c\"), formula = y ~ poly(x, 25), method = \"lm\", se = FALSE) +\n scale_color_manual(name = \"Taylor order\",\n values = c(green, red, orange), labels = c(\"4 terms\", \"7 terms\", \"25 terms\"))"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#bic-and-cp-2",
- "href": "schedule/slides/07-greedy-selection.html#bic-and-cp-2",
+ "objectID": "schedule/slides/10-basis-expansions.html#still-a-linear-smoother",
+ "href": "schedule/slides/10-basis-expansions.html#still-a-linear-smoother",
"title": "UBC Stat406 2023W",
- "section": "BIC and Cp",
- "text": "BIC and Cp\n\n\ntibble(\n BIC = summary(stepdown)$bic,\n Cp = summary(stepdown)$cp,\n size = 1:7\n) |>\n pivot_longer(-size) |>\n ggplot(aes(size, value, colour = name)) +\n geom_point() +\n geom_line() +\n facet_wrap(~name, scales = \"free_y\") +\n ylab(\"\") +\n scale_colour_manual(\n values = c(blue, orange), \n guide = \"none\"\n )"
+ "section": "Still a “linear smoother”",
+ "text": "Still a “linear smoother”\nReally, this is still linear regression, just in a transformed space.\nIt’s not linear in \\(x\\), but it is linear in \\((x,x^2,x^3)\\) (for the 3rd-order case)\nSo, we’re still doing OLS with\n\\[\\X=\\begin{bmatrix}1& x_1 & x_1^2 & x_1^3 \\\\ \\vdots&&&\\vdots\\\\1& x_n & x_n^2 & x_n^3\\end{bmatrix}\\]\nSo we can still use our nice formulas for LOO-CV, GCV, Cp, AIC, etc.\n\nmax_deg <- 20\ncv_nice <- function(mdl) mean( residuals(mdl)^2 / (1 - hatvalues(mdl))^2 ) \ncvscores <- map_dbl(seq_len(max_deg), ~ cv_nice(lm(fa ~ poly(position, .), data = arcuate)))"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#section",
- "href": "schedule/slides/07-greedy-selection.html#section",
+ "objectID": "schedule/slides/10-basis-expansions.html#section",
+ "href": "schedule/slides/10-basis-expansions.html#section",
"title": "UBC Stat406 2023W",
"section": "",
- "text": "somehow, for this seed, everything is the same"
+ "text": "Code\nlibrary(cowplot)\ng1 <- ggplot(tibble(cvscores, degrees = seq(max_deg)), aes(degrees, cvscores)) +\n geom_point(colour = blue) +\n geom_line(colour = blue) + \n labs(ylab = 'LOO-CV', xlab = 'polynomial degree') +\n geom_vline(xintercept = which.min(cvscores), linetype = \"dotted\") \ng2 <- ggplot(arcuate, aes(position, fa)) + \n geom_point(colour = blue) + \n geom_smooth(\n colour = orange, \n formula = y ~ poly(x, which.min(cvscores)), \n method = \"lm\", \n se = FALSE\n )\nplot_grid(g1, g2, ncol = 2)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#randomness-and-prediction-error",
- "href": "schedule/slides/07-greedy-selection.html#randomness-and-prediction-error",
+ "objectID": "schedule/slides/10-basis-expansions.html#other-bases",
+ "href": "schedule/slides/10-basis-expansions.html#other-bases",
"title": "UBC Stat406 2023W",
- "section": "Randomness and prediction error",
- "text": "Randomness and prediction error\nAll of that was for one data set.\nDoesn’t say which procedure is better generally.\nIf we want to know how they compare generally, we should repeat many times\n\nGenerate training data\nEstimate with different algorithms\nPredict held-out set data\nExamine prediction MSE (on held-out set)\n\n\nI’m not going to do all subsets, just the truth, forward selection, backward, and the full model\nFor forward/backward selection, I’ll use Cp to choose the final size"
+ "section": "Other bases",
+ "text": "Other bases\n\nPolynomials\n\n\\(x \\mapsto \\left(1,\\ x,\\ x^2, \\ldots, x^p\\right)\\) (technically, not quite this, they are orthogonalized)\n\nLinear splines\n\n\\(x \\mapsto \\bigg(1,\\ x,\\ (x-k_1)_+,\\ (x-k_2)_+,\\ldots, (x-k_p)_+\\bigg)\\) for some choices \\(\\{k_1,\\ldots,k_p\\}\\)\n\nCubic splines\n\n\\(x \\mapsto \\bigg(1,\\ x,\\ x^2,\\ x^3,\\ (x-k_1)^3_+,\\ (x-k_2)^3_+,\\ldots, (x-k_p)^3_+\\bigg)\\) for some choices \\(\\{k_1,\\ldots,k_p\\}\\)\n\nFourier series\n\n\\(x \\mapsto \\bigg(1,\\ \\cos(2\\pi x),\\ \\sin(2\\pi x),\\ \\cos(2\\pi 2 x),\\ \\sin(2\\pi 2 x), \\ldots, \\cos(2\\pi p x),\\ \\sin(2\\pi p x)\\bigg)\\)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#code-for-simulation",
- "href": "schedule/slides/07-greedy-selection.html#code-for-simulation",
+ "objectID": "schedule/slides/10-basis-expansions.html#how-do-you-choose",
+ "href": "schedule/slides/10-basis-expansions.html#how-do-you-choose",
"title": "UBC Stat406 2023W",
- "section": "Code for simulation",
- "text": "Code for simulation\n… Annoyingly, no predict method for regsubsets, so we make one.\n\npredict.regsubsets <- function(object, newdata, risk_estimate = c(\"cp\", \"bic\"), ...) {\n risk_estimate <- match.arg(risk_estimate)\n chosen <- coef(object, which.min(summary(object)[[risk_estimate]]))\n predictors <- names(chosen)\n if (object$intercept) predictors <- predictors[-1]\n X <- newdata[, predictors]\n if (object$intercept) X <- cbind2(1, X)\n drop(as.matrix(X) %*% chosen)\n}"
+ "section": "How do you choose?",
+ "text": "How do you choose?\nProcedure 1:\n\nPick your favorite basis. This is not as easy as it sounds. For instance, if \\(f\\) is a step function, linear splines will do well with good knots, but polynomials will be terrible unless you have lots of terms.\nPerform OLS on different orders.\nUse model selection criterion to choose the order.\n\nProcedure 2:\n\nUse a bunch of high-order bases, say Linear splines and Fourier series and whatever else you like.\nUse Lasso or Ridge regression or elastic net. (combining bases can lead to multicollinearity, but we may not care)\nUse model selection criteria to choose the tuning parameter."
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#section-1",
- "href": "schedule/slides/07-greedy-selection.html#section-1",
+ "objectID": "schedule/slides/10-basis-expansions.html#try-both-procedures",
+ "href": "schedule/slides/10-basis-expansions.html#try-both-procedures",
+ "title": "UBC Stat406 2023W",
+ "section": "Try both procedures",
+ "text": "Try both procedures\n\nSplit arcuate into 75% training data and 25% testing data.\nEstimate polynomials up to 20 as before and choose best order.\nDo ridge, lasso and elastic net \\(\\alpha=.5\\) on 20th order polynomials, B splines with 20 knots, and Fourier series with \\(p=20\\). Choose tuning parameter (using lambda.1se).\nRepeat 1-3 10 times (different splits)"
+ },
+ {
+ "objectID": "schedule/slides/10-basis-expansions.html#section-1",
+ "href": "schedule/slides/10-basis-expansions.html#section-1",
"title": "UBC Stat406 2023W",
"section": "",
- "text": "simulate_and_estimate_them_all <- function(n = 406) {\n N <- 2 * n # generate 2x the amount of data (half train, half test)\n df <- tibble( # generate data\n x1 = rnorm(N), \n x2 = rnorm(N, mean = 2), \n x3 = rexp(N),\n x4 = x2 + rnorm(N, sd = .1), \n x5 = x1 + rnorm(N, sd = .1),\n x6 = x1 - x2 + rnorm(N, sd = .1), \n x7 = x1 + x3 + rnorm(N, sd = .1),\n y = x1 * 3 + x2 / 3 + rnorm(N, sd = 2.2)\n )\n train <- df[1:n, ] # half the data for training\n test <- df[(n + 1):N, ] # half the data for evaluation\n \n oracle <- lm(y ~ x1 + x2 - 1, data = train) # knowing the right model, not the coefs\n full <- lm(y ~ ., data = train)\n stepup <- regsubsets(y ~ ., data = train, method = \"forward\")\n stepdown <- regsubsets(y ~ ., data = train, method = \"backward\")\n \n tibble(\n y = test$y,\n oracle = predict(oracle, newdata = test),\n full = predict(full, newdata = test),\n stepup = predict(stepup, newdata = test),\n stepdown = predict(stepdown, newdata = test),\n truth = drop(as.matrix(test[, c(\"x1\", \"x2\")]) %*% c(3, 1/3))\n )\n}\n\nset.seed(12345)\nour_sim <- map(1:50, ~ simulate_and_estimate_them_all(406)) |>\n list_rbind(names_to = \"sim\")"
+ "text": "library(glmnet)\nmapto01 <- function(x, pad = .005) (x - min(x) + pad) / (max(x) - min(x) + 2 * pad)\nx <- mapto01(arcuate$position)\nXmat <- cbind(\n poly(x, 20), \n splines::bs(x, df = 20), \n cos(2 * pi * outer(x, 1:20)), sin(2 * pi * outer(x, 1:20))\n)\ny <- arcuate$fa\nrmse <- function(z, s) sqrt(mean( (z - s)^2 ))\nnzero <- function(x) with(x, nzero[match(lambda.1se, lambda)])\nsim <- function(maxdeg = 20, train_frac = 0.75) {\n n <- nrow(arcuate)\n train <- as.logical(rbinom(n, 1, train_frac))\n test <- !train # not precisely 25%, but on average\n polycv <- map_dbl(seq(maxdeg), ~ cv_nice(lm(y ~ Xmat[,seq(.)], subset = train))) # figure out which order to use\n bpoly <- lm(y[train] ~ Xmat[train, seq(which.min(polycv))]) # now use it\n lasso <- cv.glmnet(Xmat[train, ], y[train])\n ridge <- cv.glmnet(Xmat[train, ], y[train], alpha = 0)\n elnet <- cv.glmnet(Xmat[train, ], y[train], alpha = .5)\n tibble(\n methods = c(\"poly\", \"lasso\", \"ridge\", \"elnet\"),\n rmses = c(\n rmse(y[test], cbind(1, Xmat[test, 1:which.min(polycv)]) %*% coef(bpoly)),\n rmse(y[test], predict(lasso, Xmat[test,])),\n rmse(y[test], predict(ridge, Xmat[test,])),\n rmse(y[test], predict(elnet, Xmat[test,]))\n ),\n nvars = c(which.min(polycv), nzero(lasso), nzero(ridge), nzero(elnet))\n )\n}\nset.seed(12345)\nsim_results <- map(seq(20), sim) |> list_rbind() # repeat it 20 times"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#what-is-oracle",
- "href": "schedule/slides/07-greedy-selection.html#what-is-oracle",
+ "objectID": "schedule/slides/10-basis-expansions.html#section-2",
+ "href": "schedule/slides/10-basis-expansions.html#section-2",
"title": "UBC Stat406 2023W",
- "section": "What is “Oracle”",
- "text": "What is “Oracle”"
+ "section": "",
+ "text": "Code\nsim_results |> \n pivot_longer(-methods) |> \n ggplot(aes(methods, value, fill = methods)) + \n geom_boxplot() +\n facet_wrap(~ name, scales = \"free_y\") + \n ylab(\"\") +\n theme(legend.position = \"none\") + \n xlab(\"\") +\n scale_fill_viridis_d(begin = .2, end = 1)"
},
{
- "objectID": "schedule/slides/07-greedy-selection.html#results",
- "href": "schedule/slides/07-greedy-selection.html#results",
+ "objectID": "schedule/slides/10-basis-expansions.html#common-elements",
+ "href": "schedule/slides/10-basis-expansions.html#common-elements",
"title": "UBC Stat406 2023W",
- "section": "Results",
- "text": "Results\n\n\nour_sim |> \n group_by(sim) %>%\n summarise(\n across(oracle:truth, ~ mean((y - .)^2)), \n .groups = \"drop\"\n ) %>%\n transmute(across(oracle:stepdown, ~ . / truth - 1)) |> \n pivot_longer(\n everything(), \n names_to = \"method\", \n values_to = \"mse\"\n ) |> \n ggplot(aes(method, mse, fill = method)) +\n geom_boxplot(notch = TRUE) +\n geom_hline(yintercept = 0, linewidth = 2) +\n scale_fill_viridis_d() +\n theme(legend.position = \"none\") +\n scale_y_continuous(\n labels = scales::label_percent()\n ) +\n ylab(\"% increase in mse relative\\n to the truth\")"
+ "section": "Common elements",
+ "text": "Common elements\nIn all these cases, we transformed \\(x\\) to a higher-dimensional space\nUsed \\(p+1\\) dimensions with polynomials\nUsed \\(p+4\\) dimensions with cubic splines\nUsed \\(2p+1\\) dimensions with Fourier basis"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#meta-lecture",
- "href": "schedule/slides/09-l1-penalties.html#meta-lecture",
+ "objectID": "schedule/slides/10-basis-expansions.html#featurization",
+ "href": "schedule/slides/10-basis-expansions.html#featurization",
"title": "UBC Stat406 2023W",
- "section": "09 L1 penalties",
- "text": "09 L1 penalties\nStat 406\nDaniel J. McDonald\nLast modified – 02 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Featurization",
+ "text": "Featurization\nEach case applied a feature map to \\(x\\), call it \\(\\Phi\\)\nWe used new “features” \\(\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)\\)\nNeural networks (coming in module 4) use this idea\nYou’ve also probably seen it in earlier courses when you added interaction terms or other transformations.\n\nSome methods (notably Support Vector Machines and Ridge regression) allow \\(k=\\infty\\)\nSee [ISLR] 9.3.2 for baby overview or [ESL] 5.8 (note 😱)"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#last-time",
- "href": "schedule/slides/09-l1-penalties.html#last-time",
+ "objectID": "schedule/slides/12-why-smooth.html#meta-lecture",
+ "href": "schedule/slides/12-why-smooth.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Last time",
- "text": "Last time\n\nRidge regression\n\n\\(\\min \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2 \\st \\snorm{\\beta}_2^2 \\leq s\\)\n\nBest (in sample) linear regression model of size \\(s\\)\n\n\\(\\min \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 \\st \\snorm{\\beta}_0 \\leq s\\)\n\n\n\\(\\snorm{\\beta}_0\\) is the number of nonzero elements in \\(\\beta\\)\nFinding the “best” linear model (of size \\(s\\), among these predictors, in sample) is a nonconvex optimization problem (In fact, it is NP-hard)\nRidge regression is convex (easy to solve), but doesn’t do variable selection\nCan we somehow “interpolate” to get both?"
+ "section": "12 To(o) smooth or not to(o) smooth?",
+ "text": "12 To(o) smooth or not to(o) smooth?\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#geometry-of-convexity",
- "href": "schedule/slides/09-l1-penalties.html#geometry-of-convexity",
+ "objectID": "schedule/slides/12-why-smooth.html#last-time",
+ "href": "schedule/slides/12-why-smooth.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "Geometry of convexity",
- "text": "Geometry of convexity\n\n\nCode\nlibrary(mvtnorm)\nnormBall <- function(q = 1, len = 1000) {\n tg <- seq(0, 2 * pi, length = len)\n out <- data.frame(x = cos(tg)) %>%\n mutate(b = (1 - abs(x)^q)^(1 / q), bm = -b) %>%\n gather(key = \"lab\", value = \"y\", -x)\n out$lab <- paste0('\"||\" * beta * \"||\"', \"[\", signif(q, 2), \"]\")\n return(out)\n}\n\nellipseData <- function(n = 100, xlim = c(-2, 3), ylim = c(-2, 3),\n mean = c(1, 1), Sigma = matrix(c(1, 0, 0, .5), 2)) {\n df <- expand.grid(\n x = seq(xlim[1], xlim[2], length.out = n),\n y = seq(ylim[1], ylim[2], length.out = n)\n )\n df$z <- dmvnorm(df, mean, Sigma)\n df\n}\n\nlballmax <- function(ed, q = 1, tol = 1e-6) {\n ed <- filter(ed, x > 0, y > 0)\n for (i in 1:20) {\n ff <- abs((ed$x^q + ed$y^q)^(1 / q) - 1) < tol\n if (sum(ff) > 0) break\n tol <- 2 * tol\n }\n best <- ed[ff, ]\n best[which.max(best$z), ]\n}\n\nnbs <- list()\nnbs[[1]] <- normBall(0, 1)\nqs <- c(.5, .75, 1, 1.5, 2)\nfor (ii in 2:6) nbs[[ii]] <- normBall(qs[ii - 1])\nnbs <- bind_rows(nbs)\nnbs$lab <- factor(nbs$lab, levels = unique(nbs$lab))\nseg <- data.frame(\n lab = levels(nbs$lab)[1],\n x0 = c(-1, 0), x1 = c(1, 0), y0 = c(0, -1), y1 = c(0, 1)\n)\nlevels(seg$lab) <- levels(nbs$lab)\nggplot(nbs, aes(x, y)) +\n geom_path(size = 1.2) +\n facet_wrap(~lab, labeller = label_parsed) +\n geom_segment(data = seg, aes(x = x0, xend = x1, y = y0, yend = y1), size = 1.2) +\n theme_bw(base_family = \"\", base_size = 24) +\n coord_equal() +\n scale_x_continuous(breaks = c(-1, 0, 1)) +\n scale_y_continuous(breaks = c(-1, 0, 1)) +\n geom_vline(xintercept = 0, size = .5) +\n geom_hline(yintercept = 0, size = .5) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2]))"
+ "section": "Last time…",
+ "text": "Last time…\nWe’ve been discussing smoothing methods in 1-dimension:\n\\[\\Expect{Y\\given X=x} = f(x),\\quad x\\in\\R\\]\nWe looked at basis expansions, e.g.:\n\\[f(x) \\approx \\beta_0 + \\beta_1 x + \\beta_2 x^2 + \\cdots + \\beta_k x^k\\]\nWe looked at local methods, e.g.:\n\\[f(x_i) \\approx s_i^\\top \\y\\]\n\nWhat if \\(x \\in \\R^p\\) and \\(p>1\\)?\n\n\n\nNote that \\(p\\) means the dimension of \\(x\\), not the dimension of the space of the polynomial basis or something else. That’s why I put \\(k\\) above."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#the-best-of-both-worlds",
- "href": "schedule/slides/09-l1-penalties.html#the-best-of-both-worlds",
+ "objectID": "schedule/slides/12-why-smooth.html#kernels-and-interactions",
+ "href": "schedule/slides/12-why-smooth.html#kernels-and-interactions",
"title": "UBC Stat406 2023W",
- "section": "The best of both worlds",
- "text": "The best of both worlds\n\n\nCode\nnb <- normBall(1)\ned <- ellipseData()\nbols <- data.frame(x = 1, y = 1)\nbhat <- lballmax(ed, 1)\nggplot(nb, aes(x, y)) +\n geom_path(colour = red) +\n geom_contour(mapping = aes(z = z), colour = blue, data = ed, bins = 7) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n geom_point(data = bols) +\n coord_equal(xlim = c(-2, 2), ylim = c(-2, 2)) +\n theme_bw(base_family = \"\", base_size = 24) +\n geom_label(\n data = bols, mapping = aes(label = bquote(\"hat(beta)[ols]\")), parse = TRUE,\n nudge_x = .3, nudge_y = .3\n ) +\n geom_point(data = bhat) +\n xlab(bquote(beta[1])) +\n ylab(bquote(beta[2])) +\n geom_label(\n data = bhat, mapping = aes(label = bquote(\"hat(beta)[s]^L\")), parse = TRUE,\n nudge_x = -.4, nudge_y = -.4\n )\n\n\n\nThis regularization set…\n\n… is convex (computationally efficient)\n… has corners (performs variable selection)"
+ "section": "Kernels and interactions",
+ "text": "Kernels and interactions\nIn multivariate nonparametric regression, you estimate a surface over the input variables.\nThis is trying to find \\(\\widehat{f}(x_1,\\ldots,x_p)\\).\nTherefore, this function by construction includes interactions, handles categorical data, etc. etc.\nThis is in contrast with explicit linear models which need you to specify these things.\nThis extra complexity (automatically including interactions, as well as other things) comes with tradeoffs.\n\nMore complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#ell_1-regularized-regression",
- "href": "schedule/slides/09-l1-penalties.html#ell_1-regularized-regression",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-1",
+ "href": "schedule/slides/12-why-smooth.html#issue-1",
"title": "UBC Stat406 2023W",
- "section": "\\(\\ell_1\\)-regularized regression",
- "text": "\\(\\ell_1\\)-regularized regression\nKnown as\n\n“lasso”\n“basis pursuit”\n\nThe estimator satisfies\n\\[\\blt = \\argmin_{ \\snorm{\\beta}_1 \\leq s} \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2\\]\nIn its corresponding Lagrangian dual form:\n\\[\\bll = \\argmin_{\\beta} \\frac{1}{n}\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1\\]"
+ "section": "Issue 1",
+ "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\n\n\n\n\n\n\nImportant\n\n\nyou don’t need to memorize these formulas but you should know the intuition\nthe constants don’t matter for the intuition, but they matter for a particular data set. We don’t know them. So you estimate this."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#lasso",
- "href": "schedule/slides/09-l1-penalties.html#lasso",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-1-1",
+ "href": "schedule/slides/12-why-smooth.html#issue-1-1",
"title": "UBC Stat406 2023W",
- "section": "Lasso",
- "text": "Lasso\nWhile the ridge solution can be easily computed\n\\[\\brl = \\argmin_{\\beta} \\frac 1n \\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_2^2 = (\\X^{\\top}\\X + \\lambda \\mathbf{I})^{-1} \\X^{\\top}\\y\\]\nthe lasso solution\n\\[\\bll = \\argmin_{\\beta} \\frac 1n\\snorm{\\y-\\X\\beta}_2^2 + \\lambda \\snorm{\\beta}_1 = \\; ??\\]\ndoesn’t have a closed-form solution.\nHowever, because the optimization problem is convex, there exist efficient algorithms for computing it\n\n\nThe best are Iterative Soft Thresholding or Coordinate Descent. Gradient Descent doesn’t work very well in practice."
+ "section": "Issue 1",
+ "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nRecall, this decomposition is squared bias + variance + irreducible error\n\nIt depends on the choice of \\(h\\)\n\n\\[\\textrm{MSE}(\\hat{f}) = C_1 h^4 + \\frac{C_2}{nh} + \\sigma^2\\]\n\nUsing \\(h = cn^{-1/5}\\) balances squared bias and variance, leads to the above rate. (That balance minimizes the MSE)"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#coefficient-path-ridge-vs-lasso",
- "href": "schedule/slides/09-l1-penalties.html#coefficient-path-ridge-vs-lasso",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-1-2",
+ "href": "schedule/slides/12-why-smooth.html#issue-1-2",
"title": "UBC Stat406 2023W",
- "section": "Coefficient path: ridge vs lasso",
- "text": "Coefficient path: ridge vs lasso\n\n\nCode\nlibrary(glmnet)\ndata(prostate, package = \"ElemStatLearn\")\nX <- prostate |> dplyr::select(-train, -lpsa) |> as.matrix()\nY <- prostate$lpsa\nlasso <- glmnet(x = X, y = Y) # alpha = 1 by default\nridge <- glmnet(x = X, y = Y, alpha = 0)\nop <- par()\n\n\n\npar(mfrow = c(1, 2), mar = c(5, 3, 5, .1))\nplot(lasso, main = \"Lasso\")\nplot(ridge, main = \"Ridge\")"
+ "section": "Issue 1",
+ "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nIntuition:\nas you collect data, use a smaller bandwidth and the MSE (on future data) decreases"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#same-but-against-lambda",
- "href": "schedule/slides/09-l1-penalties.html#same-but-against-lambda",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-1-3",
+ "href": "schedule/slides/12-why-smooth.html#issue-1-3",
"title": "UBC Stat406 2023W",
- "section": "Same but against Lambda",
- "text": "Same but against Lambda\n\npar(mfrow = c(1, 2), mar = c(5, 3, 5, .1))\nplot(lasso, main = \"Lasso\", xvar = \"lambda\")\nplot(ridge, main = \"Ridge\", xvar = \"lambda\")"
+ "section": "Issue 1",
+ "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nHow does this compare to just using a linear model?\nBias\n\nThe bias of using a linear model when the truth nonlinear is a number \\(b > 0\\) which doesn’t depend on \\(n\\).\nThe bias of using kernel regression is \\(C_1/n^{4/5}\\). This goes to 0 as \\(n\\rightarrow\\infty\\).\n\nVariance\n\nThe variance of using a linear model is \\(C/n\\) no matter what\nThe variance of using kernel regression is \\(C_2/n^{4/5}\\)."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#additional-intuition-for-why-lasso-selects-variables",
- "href": "schedule/slides/09-l1-penalties.html#additional-intuition-for-why-lasso-selects-variables",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-1-4",
+ "href": "schedule/slides/12-why-smooth.html#issue-1-4",
"title": "UBC Stat406 2023W",
- "section": "Additional intuition for why Lasso selects variables",
- "text": "Additional intuition for why Lasso selects variables\nSuppose, for a particular \\(\\lambda\\), I have solutions for \\(\\widehat{\\beta}_k\\), \\(k = 1,\\ldots,j-1, j+1,\\ldots,p\\).\nLet \\(\\widehat{\\y}_{-j} = \\X_{-j}\\widehat{\\beta}_{-j}\\), and assume WLOG \\(\\overline{\\X}_k = 0\\), \\(\\X_k^\\top\\X_k = 1\\ \\forall k\\)\nOne can show that:\n\\[\n\\widehat{\\beta}_j = S\\left(\\mathbf{X}^\\top_j(\\y - \\widehat{\\y}_{-j}),\\ \\lambda\\right).\n\\]\n\\[\nS(z, \\gamma) = \\textrm{sign}(z)(|z| - \\gamma)_+ = \\begin{cases} z - \\gamma & z > \\gamma\\\\\nz + \\gamma & z < -\\gamma \\\\ 0 & |z| \\leq \\gamma \\end{cases}\n\\]\n\nIterating over this is called coordinate descent and gives the solution\n\n\n\n\nIf I were told all the other coefficient estimates.\nThen to find this one, I’d shrink when the gradient is big, or set to 0 if it gets too small.\n\n\n\nSee for example, https://doi.org/10.18637/jss.v033.i01"
+ "section": "Issue 1",
+ "text": "Issue 1\nFor \\(p=1\\), one can show that for kernels (with the correct bandwidth)\n\\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nTo conclude:\n\nbias of kernels goes to zero, bias of lines doesn’t (unless the truth is linear).\nbut variance of lines goes to zero faster than for kernels.\n\nIf the linear model is right, you win.\nBut if it’s wrong, you (eventually) lose as \\(n\\) grows.\nHow do you know if you have enough data?\nCompare of the kernel version with CV-selected tuning parameter with the estimate of the risk for the linear model."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#packages",
- "href": "schedule/slides/09-l1-penalties.html#packages",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-2",
+ "href": "schedule/slides/12-why-smooth.html#issue-2",
"title": "UBC Stat406 2023W",
- "section": "Packages",
- "text": "Packages\nThere are two main R implementations for finding lasso\n{glmnet}: lasso = glmnet(X, Y, alpha=1).\n\nSetting alpha = 0 gives ridge regression (as does lm.ridge in the MASS package)\nSetting alpha \\(\\in (0,1)\\) gives a method called the “elastic net” which combines ridge regression and lasso, more on that next lecture.\nIf you don’t specify alpha, it does lasso\n\n{lars}: lars = lars(X, Y)\n\nlars() also does other things called “Least angle” and “forward stagewise” in addition to “forward stepwise” regression\nThe path returned by lars() is more useful than that returned by glmnet().\n\n\nBut you should use {glmnet}."
+ "section": "Issue 2",
+ "text": "Issue 2\nFor \\(p>1\\), there is more trouble.\nFirst, lets look again at \\[\\textrm{MSE}(\\hat{f}) = \\frac{C_1}{n^{4/5}} + \\frac{C_2}{n^{4/5}} + \\sigma^2\\]\nThat is for \\(p=1\\). It’s not that much slower than \\(C/n\\), the variance for linear models.\nIf \\(p>1\\) similar calculations show,\n\\[\\textrm{MSE}(\\hat f) = \\frac{C_1+C_2}{n^{4/(4+p)}} + \\sigma^2 \\hspace{2em} \\textrm{MSE}(\\hat \\beta) = b + \\frac{Cp}{n} + \\sigma^2 .\\]"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#choosing-the-lambda",
- "href": "schedule/slides/09-l1-penalties.html#choosing-the-lambda",
+ "objectID": "schedule/slides/12-why-smooth.html#issue-2-1",
+ "href": "schedule/slides/12-why-smooth.html#issue-2-1",
"title": "UBC Stat406 2023W",
- "section": "Choosing the \\(\\lambda\\)",
- "text": "Choosing the \\(\\lambda\\)\nYou have to choose \\(\\lambda\\) in lasso or in ridge regression\nlasso selects variables (by setting coefficients to zero), but the value of \\(\\lambda\\) determines how many/which.\nAll of these packages come with CV built in.\nHowever, the way to do it differs from package to package"
+ "section": "Issue 2",
+ "text": "Issue 2\n\\[\\textrm{MSE}(\\hat f) = \\frac{C_1+C_2}{n^{4/(4+p)}} + \\sigma^2 \\hspace{2em} \\textrm{MSE}(\\hat \\beta) = b + \\frac{Cp}{n} + \\sigma^2 .\\]\nWhat if \\(p\\) is big (and \\(n\\) is really big)?\n\nThen \\((C_1 + C_2) / n^{4/(4+p)}\\) is still big.\nBut \\(Cp / n\\) is small.\nSo unless \\(b\\) is big, we should use the linear model.\n\nHow do you tell? Do model selection to decide.\nA very, very questionable rule of thumb: if \\(p>\\log(n)\\), don’t do smoothing."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#glmnet-version-same-procedure-for-lasso-or-ridge",
- "href": "schedule/slides/09-l1-penalties.html#glmnet-version-same-procedure-for-lasso-or-ridge",
+ "objectID": "schedule/slides/14-classification-intro.html#meta-lecture",
+ "href": "schedule/slides/14-classification-intro.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "{glmnet} version (same procedure for lasso or ridge)",
- "text": "{glmnet} version (same procedure for lasso or ridge)\n\nlasso <- cv.glmnet(X, Y) # estimate full model and CV no good reason to call glmnet() itself\n# 2. Look at the CV curve. If the dashed lines are at the boundaries, redo and adjust lambda\nlambda_min <- lasso$lambda.min # the value, not the location (or use lasso$lambda.1se)\ncoeffs <- coefficients(lasso, s = \"lambda.min\") # s can be string or a number\npreds <- predict(lasso, newx = X, s = \"lambda.1se\") # must supply `newx`\n\n\n\\(\\widehat{R}_{CV}\\) is an estimator of \\(R_n\\), it has bias and variance\nBecause we did CV, we actually have 10 \\(\\widehat{R}\\) values, 1 per split.\nCalculate the mean (that’s what we’ve been using), but what about SE?"
+ "section": "14 Classification",
+ "text": "14 Classification\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#section",
- "href": "schedule/slides/09-l1-penalties.html#section",
+ "objectID": "schedule/slides/14-classification-intro.html#an-overview-of-classification",
+ "href": "schedule/slides/14-classification-intro.html#an-overview-of-classification",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "par(mfrow = c(1, 2), mar = c(5, 3, 3, 0))\nplot(lasso) # a plot method for the cv fit\nplot(lasso$glmnet.fit) # the glmnet.fit == glmnet(X,Y)\nabline(v = colSums(abs(coef(lasso$glmnet.fit)[-1, drop(lasso$index)])), lty = 2)"
+ "section": "An Overview of Classification",
+ "text": "An Overview of Classification\n\nA person arrives at an emergency room with a set of symptoms that could be 1 of 3 possible conditions. Which one is it?\nAn online banking service must be able to determine whether each transaction is fraudulent or not, using a customer’s location, past transaction history, etc.\nGiven a set of individuals sequenced DNA, can we determine whether various mutations are associated with different phenotypes?\n\n\nThese problems are not regression problems. They are classification problems."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#paths-with-chosen-lambda",
- "href": "schedule/slides/09-l1-penalties.html#paths-with-chosen-lambda",
+ "objectID": "schedule/slides/14-classification-intro.html#the-set-up",
+ "href": "schedule/slides/14-classification-intro.html#the-set-up",
"title": "UBC Stat406 2023W",
- "section": "Paths with chosen lambda",
- "text": "Paths with chosen lambda\n\nridge <- cv.glmnet(X, Y, alpha = 0, lambda.min.ratio = 1e-10) # added to get a minimum\npar(mfrow = c(1, 4))\nplot(ridge, main = \"Ridge\")\nplot(lasso, main = \"Lasso\")\nplot(ridge$glmnet.fit, main = \"Ridge\")\nabline(v = sum(abs(coef(ridge)))) # defaults to `lambda.1se`\nplot(lasso$glmnet.fit, main = \"Lasso\")\nabline(v = sum(abs(coef(lasso)))) # again, `lambda.1se` unless told otherwise"
+ "section": "The Set-up",
+ "text": "The Set-up\nIt begins just like regression: suppose we have observations \\[\\{(x_1,y_1),\\ldots,(x_n,y_n)\\}\\]\nAgain, we want to estimate a function that maps \\(X\\) to \\(Y\\) to predict as yet observed data.\n(This function is known as a classifier)\nThe same constraints apply:\n\nWe want a classifier that predicts test data, not just the training data.\nOften, this comes with the introduction of some bias to get lower variance and better predictions."
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#degrees-of-freedom",
- "href": "schedule/slides/09-l1-penalties.html#degrees-of-freedom",
+ "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality",
+ "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality",
"title": "UBC Stat406 2023W",
- "section": "Degrees of freedom",
- "text": "Degrees of freedom\nLasso is not a linear smoother. There is no matrix \\(S\\) such that \\(\\widehat{\\y} = \\mathbf{S}\\y\\) for the predicted values from lasso.\n\nWe can’t use cv_nice().\nWe don’t have \\(\\tr{\\mathbf{S}} = \\textrm{df}\\) because there is no \\(\\mathbf{S}\\).\n\nHowever,\n\nOne can show that \\(\\textrm{df}_\\lambda = \\E[\\#(\\widehat{\\beta}_\\lambda \\neq 0)] = \\E[||\\widehat{\\beta}_\\lambda||_0]\\)\nThe proof is PhD-level material\n\nNote that the \\(\\widehat{\\textrm{df}}_\\lambda\\) is shown on the CV plot and that lasso.glmnet$glmnet.fit$df contains this value for all \\(\\lambda\\)."
+ "section": "How do we measure quality?",
+ "text": "How do we measure quality?\nBefore in regression, we have \\(y_i \\in \\mathbb{R}\\) and use squared error loss to measure accuracy: \\((y - \\hat{y})^2\\).\nInstead, let \\(y \\in \\mathcal{K} = \\{1,\\ldots, K\\}\\)\n(This is arbitrary, sometimes other numbers, such as \\(\\{-1,1\\}\\) will be used)\nWe can always take “factors”: \\(\\{\\textrm{cat},\\textrm{dog}\\}\\) and convert to integers, which is what we assume.\nWe again make predictions \\(\\hat{y}=k\\) based on the data\n\nWe get zero loss if we predict the right class\nWe lose \\(\\ell(k,k')\\) on \\((k\\neq k')\\) for incorrect predictions"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#other-flavours",
- "href": "schedule/slides/09-l1-penalties.html#other-flavours",
+ "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-1",
+ "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-1",
"title": "UBC Stat406 2023W",
- "section": "Other flavours",
- "text": "Other flavours\n\nThe elastic net\n\ngenerally used for correlated variables that combines a ridge/lasso penalty. Use glmnet(..., alpha = a) (0 < a < 1).\n\nGrouped lasso\n\nwhere variables are included or excluded in groups. Required for factors (1-hot encoding)\n\nRelaxed lasso\n\nTakes the estimated model from lasso and fits the full least squares solution on the selected covariates (less bias, more variance). Use glmnet(..., relax = TRUE).\n\nDantzig selector\n\na slightly modified version of the lasso"
+ "section": "How do we measure quality?",
+ "text": "How do we measure quality?\nSuppose you have a fever of 39º C. You get a rapid test on campus.\n\n\n\nLoss\nTest +\nTest -\n\n\n\n\nAre +\n0\nInfect others\n\n\nAre -\nIsolation\n0"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#lasso-cinematic-universe",
- "href": "schedule/slides/09-l1-penalties.html#lasso-cinematic-universe",
+ "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-2",
+ "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-2",
"title": "UBC Stat406 2023W",
- "section": "Lasso cinematic universe",
- "text": "Lasso cinematic universe\n\n\n\nSCAD\n\na non-convex version of lasso that adds a more severe variable selection penalty\n\n\\(\\sqrt{\\textrm{lasso}}\\)\n\nclaims to be tuning parameter free (but isn’t). Uses \\(\\Vert\\cdot\\Vert_2\\) instead of \\(\\Vert\\cdot\\Vert_1\\) for the loss.\n\nGeneralized lasso\n\nAdds various additional matrices to the penalty term (e.g. \\(\\Vert D\\beta\\Vert_1\\)).\n\nArbitrary combinations\n\ncombine the above penalties in your favourite combinations"
+ "section": "How do we measure quality?",
+ "text": "How do we measure quality?\nSuppose you have a fever of 39º C. You get a rapid test on campus.\n\n\n\nLoss\nTest +\nTest -\n\n\n\n\nAre +\n0\n1\n\n\nAre -\n1\n0"
},
{
- "objectID": "schedule/slides/09-l1-penalties.html#warnings-on-regularized-regression",
- "href": "schedule/slides/09-l1-penalties.html#warnings-on-regularized-regression",
+ "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-3",
+ "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-3",
"title": "UBC Stat406 2023W",
- "section": "Warnings on regularized regression",
- "text": "Warnings on regularized regression\n\nThis isn’t a method unless you say how to choose \\(\\lambda\\).\nThe intercept is never penalized. Adds an extra degree-of-freedom.\nPredictor scaling is very important.\nDiscrete predictors need groupings.\nCentering the predictors is important\n(These all work with other likelihoods.)\n\n\nSoftware handles most of these automatically, but not always. (No Lasso with factor predictors.)"
+ "section": "How do we measure quality?",
+ "text": "How do we measure quality?\n\nWe’re going to use \\(g(x)\\) to be our classifier. It takes values in \\(\\mathcal{K}\\)."
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#meta-lecture",
- "href": "schedule/slides/11-kernel-smoothers.html#meta-lecture",
+ "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-4",
+ "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-4",
"title": "UBC Stat406 2023W",
- "section": "11 Local methods",
- "text": "11 Local methods\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "How do we measure quality?",
+ "text": "How do we measure quality?\nAgain, we appeal to risk \\[R_n(g) = E [\\ell(Y,g(X))]\\] If we use the law of total probability, this can be written \\[R_n(g) = E_X \\sum_{y=1}^K \\ell(y,\\; g(X)) Pr(Y = y \\given X)\\] We minimize this over a class of options \\(\\mathcal{G}\\), to produce \\[g_*(X) = \\argmin_{g\\in\\mathcal{G}} E_X \\sum_{y=1}^K \\ell(y,g(X)) Pr(Y = y \\given X)\\]"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#last-time",
- "href": "schedule/slides/11-kernel-smoothers.html#last-time",
+ "objectID": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-5",
+ "href": "schedule/slides/14-classification-intro.html#how-do-we-measure-quality-5",
"title": "UBC Stat406 2023W",
- "section": "Last time…",
- "text": "Last time…\nWe looked at feature maps as a way to do nonlinear regression.\nWe used new “features” \\(\\Phi(x) = \\bigg(\\phi_1(x),\\ \\phi_2(x),\\ldots,\\phi_k(x)\\bigg)\\)\nNow we examine an alternative\nSuppose I just look at the “neighbours” of some point (based on the \\(x\\)-values)\nI just average the \\(y\\)’s at those locations together"
+ "section": "How do we measure quality?",
+ "text": "How do we measure quality?\n\\(g_*\\) is named the Bayes’ classifier for loss \\(\\ell\\) in class \\(\\mathcal{G}\\).\n\\(R_n(g_*)\\) is the called the Bayes’ limit or Bayes’ Risk.\nIt’s the best we could hope to do in terms of \\(\\ell\\) if we knew the distribution of the data.\n\nBut we don’t, so we’ll try to do our best to estimate \\(g_*\\)."
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#lets-use-3-neighbours",
- "href": "schedule/slides/11-kernel-smoothers.html#lets-use-3-neighbours",
+ "objectID": "schedule/slides/14-classification-intro.html#best-classifier-overall",
+ "href": "schedule/slides/14-classification-intro.html#best-classifier-overall",
"title": "UBC Stat406 2023W",
- "section": "Let’s use 3 neighbours",
- "text": "Let’s use 3 neighbours\n\n\nCode\nlibrary(cowplot)\ndata(arcuate, package = \"Stat406\")\nset.seed(406406)\narcuate_unif <- arcuate |> slice_sample(n = 40) |> arrange(position)\npt <- 15\nnn <- 3\nseq_range <- function(x, n = 101) seq(min(x, na.rm = TRUE), max(x, na.rm = TRUE), length.out = n)\nneibs <- sort.int(abs(arcuate_unif$position - arcuate_unif$position[pt]), index.return = TRUE)$ix[1:nn]\narcuate_unif$neighbours = seq_len(40) %in% neibs\ng1 <- ggplot(arcuate_unif, aes(position, fa, colour = neighbours)) + \n geom_point() +\n scale_colour_manual(values = c(blue, red)) + \n geom_vline(xintercept = arcuate_unif$position[pt], colour = red) + \n annotate(\"rect\", fill = red, alpha = .25, ymin = -Inf, ymax = Inf,\n xmin = min(arcuate_unif$position[neibs]), \n xmax = max(arcuate_unif$position[neibs])\n ) +\n theme(legend.position = \"none\")\ng2 <- ggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(\n data = tibble(\n position = seq_range(arcuate_unif$position),\n fa = FNN::knn.reg(\n arcuate_unif$position, matrix(position, ncol = 1),\n y = arcuate_unif$fa\n )$pred\n ),\n colour = orange, linewidth = 2\n )\nplot_grid(g1, g2, ncol = 2)"
+ "section": "Best classifier overall",
+ "text": "Best classifier overall\n(for now, we limit to 2 classes)\nOnce we make a specific choice for \\(\\ell\\), we can find \\(g_*\\) exactly (pretending we know the distribution)\nBecause \\(Y\\) takes only a few values, zero-one loss is natural (but not the only option) \\[\\ell(y,\\ g(x)) = \\begin{cases}0 & y=g(x)\\\\1 & y\\neq g(x) \\end{cases} \\Longrightarrow R_n(g) = \\Expect{\\ell(Y,\\ g(X))} = Pr(g(X) \\neq Y),\\]"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#knn",
- "href": "schedule/slides/11-kernel-smoothers.html#knn",
+ "objectID": "schedule/slides/14-classification-intro.html#best-classifier-overall-1",
+ "href": "schedule/slides/14-classification-intro.html#best-classifier-overall-1",
"title": "UBC Stat406 2023W",
- "section": "KNN",
- "text": "KNN\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ndata(arcuate, package = \"Stat406\")\nlibrary(FNN)\narcuate_unif <- arcuate |> \n slice_sample(n = 40) |> \n arrange(position) \n\nnew_position <- seq(\n min(arcuate_unif$position), \n max(arcuate_unif$position),\n length.out = 101\n)\n\nknn3 <- knn.reg(\n train = arcuate_unif$position, \n test = matrix(arcuate_unif$position, ncol = 1), \n y = arcuate_unif$fa, \n k = 3\n)"
+ "section": "Best classifier overall",
+ "text": "Best classifier overall\n\n\n\nLoss\nTest +\nTest -\n\n\n\n\nAre +\n0\n1\n\n\nAre -\n1\n0"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#this-method-is-k-nearest-neighbours.",
- "href": "schedule/slides/11-kernel-smoothers.html#this-method-is-k-nearest-neighbours.",
+ "objectID": "schedule/slides/14-classification-intro.html#best-classifier-overall-2",
+ "href": "schedule/slides/14-classification-intro.html#best-classifier-overall-2",
"title": "UBC Stat406 2023W",
- "section": "This method is \\(K\\)-nearest neighbours.",
- "text": "This method is \\(K\\)-nearest neighbours.\nIt’s a linear smoother just like in previous lectures: \\(\\widehat{\\mathbf{y}} = \\mathbf{S} \\mathbf{y}\\) for some matrix \\(S\\).\nYou should imagine what \\(\\mathbf{S}\\) looks like.\nWhat is the degrees of freedom of KNN?\nKNN averages the neighbours with equal weight.\nBut some neighbours are “closer” than other neighbours."
+ "section": "Best classifier overall",
+ "text": "Best classifier overall\nThis means we want to classify a new observation \\((x_0,y_0)\\) such that \\(g(x_0) = y_0\\) as often as possible\nUnder this loss, we have \\[\n\\begin{aligned}\ng_*(X) &= \\argmin_{g} Pr(g(X) \\neq Y) \\\\\n&= \\argmin_{g} \\left[ 1 - Pr(Y = g(x) | X=x)\\right] \\\\\n&= \\argmax_{g} Pr(Y = g(x) | X=x )\n\\end{aligned}\n\\]"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#local-averages",
- "href": "schedule/slides/11-kernel-smoothers.html#local-averages",
+ "objectID": "schedule/slides/14-classification-intro.html#estimating-g_",
+ "href": "schedule/slides/14-classification-intro.html#estimating-g_",
"title": "UBC Stat406 2023W",
- "section": "Local averages",
- "text": "Local averages\nInstead of choosing the number of neighbours to average, we can average any observations within a certain distance.\n\n\nThe boxes have width 30."
+ "section": "Estimating \\(g_*\\)",
+ "text": "Estimating \\(g_*\\)\nClassifier approach 1 (empirical risk minimization):\n\nChoose some class of classifiers \\(\\mathcal{G}\\).\nFind \\(\\argmin_{g\\in\\mathcal{G}} \\sum_{i = 1}^n I(g(x_i) \\neq y_i)\\)"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#what-is-a-kernel-smoother",
- "href": "schedule/slides/11-kernel-smoothers.html#what-is-a-kernel-smoother",
+ "objectID": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes",
+ "href": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes",
"title": "UBC Stat406 2023W",
- "section": "What is a “kernel” smoother?",
- "text": "What is a “kernel” smoother?\n\nThe mathematics:\n\n\nA kernel is any function \\(K\\) such that for any \\(u\\), \\(K(u) \\geq 0\\), \\(\\int du K(u)=1\\) and \\(\\int uK(u)du=0\\).\n\n\nThe idea: a kernel is a nice way to take weighted averages. The kernel function gives the weights.\nThe previous example is called the boxcar kernel."
+ "section": "Bayes’ Classifier and class densities (2 classes)",
+ "text": "Bayes’ Classifier and class densities (2 classes)\nUsing Bayes’ theorem, and recalling that \\(f_*(X) = E[Y \\given X]\\)\n\\[\\begin{aligned}\nf_*(X) & = E[Y \\given X] = Pr(Y = 1 \\given X) \\\\\n&= \\frac{Pr(X\\given Y=1) Pr(Y=1)}{Pr(X)}\\\\\n& =\\frac{Pr(X\\given Y = 1) Pr(Y = 1)}{\\sum_{k \\in \\{0,1\\}} Pr(X\\given Y = k) Pr(Y = k)} \\\\ & = \\frac{p_1(X) \\pi}{ p_1(X)\\pi + p_0(X)(1-\\pi)}\\end{aligned}\\]\n\nWe call \\(p_k(X)\\) the class (conditional) densities\n\\(\\pi\\) is the marginal probability \\(P(Y=1)\\)"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-with-the-boxcar",
- "href": "schedule/slides/11-kernel-smoothers.html#smoothing-with-the-boxcar",
+ "objectID": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes-1",
+ "href": "schedule/slides/14-classification-intro.html#bayes-classifier-and-class-densities-2-classes-1",
"title": "UBC Stat406 2023W",
- "section": "Smoothing with the boxcar",
- "text": "Smoothing with the boxcar\n\n\nCode\ntestpts <- seq(0, 200, length.out = 101)\ndmat <- abs(outer(testpts, arcuate_unif$position, \"-\"))\nS <- (dmat < 15)\nS <- S / rowSums(S)\nboxcar <- tibble(position = testpts, fa = S %*% arcuate_unif$fa)\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(data = boxcar, colour = orange)\n\n\n\nThis one gives the same non-zero weight to all points within \\(\\pm 15\\) range."
+ "section": "Bayes’ Classifier and class densities (2 classes)",
+ "text": "Bayes’ Classifier and class densities (2 classes)\nThe Bayes’ Classifier (best classifier for 0-1 loss) can be rewritten\n\\[g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}\\]\nApproach 2: estimate everything in the expression above.\n\nWe need to estimate \\(p_1\\), \\(p_2\\), \\(\\pi\\), \\(1-\\pi\\)\nEasily extended to more than two classes"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#other-kernels",
- "href": "schedule/slides/11-kernel-smoothers.html#other-kernels",
+ "objectID": "schedule/slides/14-classification-intro.html#an-alternative-easy-classifier",
+ "href": "schedule/slides/14-classification-intro.html#an-alternative-easy-classifier",
"title": "UBC Stat406 2023W",
- "section": "Other kernels",
- "text": "Other kernels\nMost of the time, we don’t use the boxcar because the weights are weird. (constant)\nA more common one is the Gaussian kernel:\n\n\nCode\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 7.5) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n\n\n\nFor the plot, I made \\(\\sigma=7.5\\).\nNow the weights “die away” for points farther from where we’re predicting. (but all nonzero!!)"
+ "section": "An alternative easy classifier",
+ "text": "An alternative easy classifier\nZero-One loss was natural, but try something else\nLet’s try using squared error loss instead: \\(\\ell(y,\\ f(x)) = (y - f(x))^2\\)\nThen, the Bayes’ Classifier (the function that minimizes the Bayes Risk) is \\[g_*(x) = f_*(x) = E[ Y \\given X = x] = Pr(Y = 1 \\given X)\\] (recall that \\(f_* \\in [0,1]\\) is still the regression function)\nIn this case, our “class” will actually just be a probability. But this isn’t a class, so it’s a bit unsatisfying.\nHow do we get a class prediction?\n\nDiscretize the probability:\n\\[g(x) = \\begin{cases}0 & f_*(x) < 1/2\\\\1 & \\textrm{else}\\end{cases}\\]"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#other-kernels-1",
- "href": "schedule/slides/11-kernel-smoothers.html#other-kernels-1",
+ "objectID": "schedule/slides/14-classification-intro.html#estimating-g_-1",
+ "href": "schedule/slides/14-classification-intro.html#estimating-g_-1",
"title": "UBC Stat406 2023W",
- "section": "Other kernels",
- "text": "Other kernels\nWhat if I made \\(\\sigma=15\\)?\n\n\nCode\ngaussian_kernel <- function(x) dnorm(x, mean = arcuate_unif$position[15], sd = 15) * 3\nggplot(arcuate_unif, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_segment(aes(x = position[15], y = 0, xend = position[15], yend = fa[15]), colour = orange) +\n stat_function(fun = gaussian_kernel, geom = \"area\", fill = orange)\n\n\n\nBefore, points far from \\(x_{15}\\) got very small weights, now they have more influence.\nFor the Gaussian kernel, \\(\\sigma\\) determines something like the “range” of the smoother."
+ "section": "Estimating \\(g_*\\)",
+ "text": "Estimating \\(g_*\\)\nApproach 3:\n\nEstimate \\(f_*\\) using any method we’ve learned so far.\nPredict 0 if \\(\\hat{f}(x)\\) is less than 1/2, else predict 1."
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#many-gaussians",
- "href": "schedule/slides/11-kernel-smoothers.html#many-gaussians",
+ "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression",
+ "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression",
"title": "UBC Stat406 2023W",
- "section": "Many Gaussians",
- "text": "Many Gaussians\nThe following code creates \\(\\mathbf{S}\\) for Gaussian kernel smoothers with different \\(\\sigma\\)\n\ndmat <- as.matrix(dist(x))\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg), \"/\") # make the rows sum to 1.\n}\n\n\n\nCode\nSgauss <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) # not an argument, uses the global dmat\n sweep(gg, 1, rowSums(gg),'/') # make the rows sum to 1.\n}\nboxcar$S15 = with(arcuate_unif, Sgauss(15) %*% fa)\nboxcar$S08 = with(arcuate_unif, Sgauss(8) %*% fa)\nboxcar$S30 = with(arcuate_unif, Sgauss(30) %*% fa)\nbc = boxcar %>% select(position, S15, S08, S30) %>% \n pivot_longer(-position, names_to = \"Sigma\")\nggplot(arcuate_unif, aes(position, fa)) + \n geom_point(colour = blue) + \n geom_line(data = bc, aes(position, value, colour = Sigma), linewidth = 1.5) +\n scale_colour_brewer(palette = \"Set1\")"
+ "section": "Claim: Classification is easier than regression",
+ "text": "Claim: Classification is easier than regression\n\nLet \\(\\hat{f}\\) be any estimate of \\(f_*\\)\nLet \\(\\widehat{g} (x) = \\begin{cases}0 & \\hat f(x) < 1/2\\\\1 & else\\end{cases}\\)\n\nProof by picture."
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#the-bandwidth",
- "href": "schedule/slides/11-kernel-smoothers.html#the-bandwidth",
+ "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-1",
+ "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-1",
"title": "UBC Stat406 2023W",
- "section": "The bandwidth",
- "text": "The bandwidth\n\nChoosing \\(\\sigma\\) is very important.\nThis “range” parameter is called the bandwidth.\nIt is way more important than which kernel you use.\nThe default kernel in ksmooth() is something called ‘Epanechnikov’:\n\n\nepan <- function(x) 3/4 * (1 - x^2) * (abs(x) < 1)\nggplot(data.frame(x = c(-2, 2)), aes(x)) + stat_function(fun = epan, colour = green, linewidth = 2)"
+ "section": "Claim: Classification is easier than regression",
+ "text": "Claim: Classification is easier than regression\n\n\nCode\nset.seed(12345)\nx <- 1:99 / 100\ny <- rbinom(99, 1, \n .25 + .5 * (x > .3 & x < .5) + \n .6 * (x > .7))\ndmat <- as.matrix(dist(x))\nksm <- function(sigma) {\n gg <- dnorm(dmat, sd = sigma) \n sweep(gg, 1, rowSums(gg), '/') %*% y\n}\nfstar <- ksm(.04)\ngg <- tibble(x = x, fstar = fstar, y = y) %>%\n ggplot(aes(x)) +\n geom_point(aes(y = y), color = blue) +\n geom_line(aes(y = fstar), color = orange, size = 2) +\n coord_cartesian(ylim = c(0,1), xlim = c(0,1)) +\n annotate(\"label\", x = .75, y = .65, label = \"f_star\", size = 5)\ngg"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#choosing-the-bandwidth",
- "href": "schedule/slides/11-kernel-smoothers.html#choosing-the-bandwidth",
+ "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-2",
+ "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-2",
"title": "UBC Stat406 2023W",
- "section": "Choosing the bandwidth",
- "text": "Choosing the bandwidth\nAs we have discussed, kernel smoothing (and KNN) are linear smoothers\n\\[\\widehat{\\mathbf{y}} = \\mathbf{S}\\mathbf{y}\\]\nThe degrees of freedom is \\(\\textrm{tr}(\\mathbf{S})\\)\nTherefore we can use our model selection criteria from before\n\nUnfortunately, these don’t satisfy the “technical condition”, so cv_nice() doesn’t give LOO-CV"
+ "section": "Claim: Classification is easier than regression",
+ "text": "Claim: Classification is easier than regression\n\n\nCode\ngg + geom_hline(yintercept = .5, color = green)"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data",
- "href": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data",
+ "objectID": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-3",
+ "href": "schedule/slides/14-classification-intro.html#claim-classification-is-easier-than-regression-3",
"title": "UBC Stat406 2023W",
- "section": "Smoothing the full Lidar data",
- "text": "Smoothing the full Lidar data\n\nar <- arcuate |> slice_sample(n = 200)\n\ngcv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - mean(diag(S)))^2 )\n}\n\nfake_loocv <- function(y, S) {\n yhat <- S %*% y\n mean( (y - yhat)^2 / (1 - diag(S))^2 )\n}\n\ndmat <- as.matrix(dist(ar$position))\nsigmas <- 10^(seq(log10(300), log10(.3), length = 100))\n\ngcvs <- map_dbl(sigmas, ~ gcv(ar$fa, Sgauss(.x)))\nflcvs <- map_dbl(sigmas, ~ fake_loocv(ar$fa, Sgauss(.x)))\nbest_s <- sigmas[which.min(gcvs)]\nother_s <- sigmas[which.min(flcvs)]\n\nar$smoothed <- Sgauss(best_s) %*% ar$fa\nar$other <- Sgauss(other_s) %*% ar$fa"
+ "section": "Claim: Classification is easier than regression",
+ "text": "Claim: Classification is easier than regression\n\n\nCode\ntib <- tibble(x = x, fstar = fstar, y = y)\nggplot(tib) +\n geom_vline(data = filter(tib, fstar > 0.5), aes(xintercept = x), alpha = .5, color = green) +\n annotate(\"label\", x = .75, y = .65, label = \"f_star\", size = 5) + \n geom_point(aes(x = x, y = y), color = blue) +\n geom_line(aes(x = x, y = fstar), color = orange, size = 2) +\n coord_cartesian(ylim = c(0,1), xlim = c(0,1))"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data-1",
- "href": "schedule/slides/11-kernel-smoothers.html#smoothing-the-full-lidar-data-1",
+ "objectID": "schedule/slides/14-classification-intro.html#how-to-find-a-classifier",
+ "href": "schedule/slides/14-classification-intro.html#how-to-find-a-classifier",
"title": "UBC Stat406 2023W",
- "section": "Smoothing the full Lidar data",
- "text": "Smoothing the full Lidar data\n\n\nCode\ng3 <- ggplot(data.frame(sigma = sigmas, gcv = gcvs), aes(sigma, gcv)) +\n geom_point(colour = blue) +\n geom_vline(xintercept = best_s, colour = red) +\n scale_x_log10() +\n xlab(sprintf(\"Sigma, best is sig = %.2f\", best_s))\ng4 <- ggplot(ar, aes(position, fa)) +\n geom_point(colour = blue) +\n geom_line(aes(y = smoothed), colour = orange, linewidth = 2)\nplot_grid(g3, g4, nrow = 1)\n\n\n\nI considered \\(\\sigma \\in [0.3,\\ 300]\\) and used \\(3.97\\).\nIt’s too wiggly, to my eye. Typical for GCV."
+ "section": "How to find a classifier",
+ "text": "How to find a classifier\nWhy did we go through that math?\nEach of these approaches suggests a way to find a classifier\n\nEmpirical risk minimization: Choose a set of classifiers \\(\\mathcal{G}\\) and find \\(g \\in \\mathcal{G}\\) that minimizes some estimate of \\(R_n(g)\\)\n\n\n(This can be quite challenging as, unlike in regression, the training error is nonconvex)\n\n\nDensity estimation: Estimate \\(\\pi\\) and \\(p_k\\)\nRegression: Find an estimate \\(\\hat{f}\\) of \\(f^*\\) and compare the predicted value to 1/2"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#smoothing-manually",
- "href": "schedule/slides/11-kernel-smoothers.html#smoothing-manually",
+ "objectID": "schedule/slides/14-classification-intro.html#section",
+ "href": "schedule/slides/14-classification-intro.html#section",
"title": "UBC Stat406 2023W",
- "section": "Smoothing manually",
- "text": "Smoothing manually\nI did Kernel Smoothing “manually”\n\nFor a fixed bandwidth\nCompute the smoothing matrix\nMake the predictions\nRepeat and compute GCV\n\nThe point is to “show how it works”. It’s also really easy."
+ "section": "",
+ "text": "Easiest classifier when \\(y\\in \\{0,\\ 1\\}\\):\n(stupidest version of the third case…)\n\nghat <- round(predict(lm(y ~ ., data = trainingdata)))\n\nThink about why this may not be very good. (At least 2 reasons I can think of.)"
},
{
- "objectID": "schedule/slides/11-kernel-smoothers.html#r-functions-packages",
- "href": "schedule/slides/11-kernel-smoothers.html#r-functions-packages",
+ "objectID": "schedule/slides/16-logistic-regression.html#meta-lecture",
+ "href": "schedule/slides/16-logistic-regression.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "R functions / packages",
- "text": "R functions / packages\nThere are a number of other ways to do this in R\n\nloess()\nksmooth()\nKernSmooth::locpoly()\nmgcv::gam()\nnp::npreg()\n\nThese have tricks and ways of doing CV and other things automatically.\n\nNote\n\nAll I needed was the distance matrix dist(x).\n\n\nGiven ANY distance function\n\n\nsay, \\(d(\\mathbf{x}_i, \\mathbf{x}_j) = \\Vert\\mathbf{x}_i - \\mathbf{x}_j\\Vert_2 + I(x_{i,3} = x_{j,3})\\)\n\n\nI can use these methods."
+ "section": "16 Logistic regression",
+ "text": "16 Logistic regression\nStat 406\nDaniel J. McDonald\nLast modified – 25 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/13-gams-trees.html#meta-lecture",
- "href": "schedule/slides/13-gams-trees.html#meta-lecture",
+ "objectID": "schedule/slides/16-logistic-regression.html#last-time",
+ "href": "schedule/slides/16-logistic-regression.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "13 GAMs and Trees",
- "text": "13 GAMs and Trees\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Last time",
+ "text": "Last time\n\nWe showed that with two classes, the Bayes’ classifier is\n\n\\[g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}\\]\nwhere \\(p_1(X) = Pr(X \\given Y=1)\\) and \\(p_0(X) = Pr(X \\given Y=0)\\)\n\nWe then looked at what happens if we assume \\(Pr(X \\given Y=y)\\) is Normally distributed.\n\nWe then used this distribution and the class prior \\(\\pi\\) to find the posterior \\(Pr(Y=1 \\given X=x)\\)."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#gams",
- "href": "schedule/slides/13-gams-trees.html#gams",
+ "objectID": "schedule/slides/16-logistic-regression.html#direct-model",
+ "href": "schedule/slides/16-logistic-regression.html#direct-model",
"title": "UBC Stat406 2023W",
- "section": "GAMs",
- "text": "GAMs\nLast time we discussed smoothing in multiple dimensions.\nHere we introduce the concept of GAMs (Generalized Additive Models)\nThe basic idea is to imagine that the response is the sum of some functions of the predictors:\n\\[\\Expect{Y \\given X=x} = \\beta_0 + f_1(x_{1})+\\cdots+f_p(x_{p}).\\]\nNote that OLS is a GAM (take \\(f_j(x_{j})=\\beta_j x_{j}\\)):\n\\[\\Expect{Y \\given X=x} = \\beta_0 + \\beta_1 x_{1}+\\cdots+\\beta_p x_{p}.\\]"
+ "section": "Direct model",
+ "text": "Direct model\nInstead, let’s directly model the posterior\n\\[\n\\begin{aligned}\nPr(Y = 1 \\given X=x) & = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}} \\\\\nPr(Y = 0 | X=x) & = \\frac{1}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}=1-\\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\n\\end{aligned}\n\\]\nThis is logistic regression."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#gams-1",
- "href": "schedule/slides/13-gams-trees.html#gams-1",
+ "objectID": "schedule/slides/16-logistic-regression.html#why-this",
+ "href": "schedule/slides/16-logistic-regression.html#why-this",
"title": "UBC Stat406 2023W",
- "section": "Gams",
- "text": "Gams\nThese work by estimating each \\(f_i\\) using basis expansions in predictor \\(i\\)\nThe algorithm for fitting these things is called “backfitting” (very similar to the CD intuition for lasso):\n\nCenter \\(\\y\\) and \\(\\X\\).\nHold \\(f_k\\) for all \\(k\\neq j\\) fixed, and regress \\(\\X_j\\) on \\((\\y - \\widehat{\\y}_{-j})\\) using your favorite smoother.\nRepeat for \\(1\\leq j\\leq p\\).\nRepeat steps 2 and 3 until the estimated functions “stop moving” (iterate)\nReturn the results."
+ "section": "Why this?",
+ "text": "Why this?\n\\[Pr(Y = 1 \\given X=x) = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\\]\n\nThere are lots of ways to map \\(\\R \\mapsto [0,1]\\).\nThe “logistic” function \\(z\\mapsto (1 + \\exp(-z))^{-1} = \\exp(z) / (1+\\exp(z)) =:h(z)\\) is nice.\nIt’s symmetric: \\(1 - h(z) = h(-z)\\)\nHas a nice derivative: \\(h'(z) = \\frac{\\exp(z)}{(1 + \\exp(z))^2} = h(z)(1-h(z))\\).\nIt’s the inverse of the “log-odds” (logit): \\(\\log(p / (1-p))\\)."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#very-small-example",
- "href": "schedule/slides/13-gams-trees.html#very-small-example",
+ "objectID": "schedule/slides/16-logistic-regression.html#another-linear-classifier",
+ "href": "schedule/slides/16-logistic-regression.html#another-linear-classifier",
"title": "UBC Stat406 2023W",
- "section": "Very small example",
- "text": "Very small example\n\nlibrary(mgcv)\nset.seed(12345)\nn <- 500\nsimple <- tibble(\n x1 = runif(n, 0, 2*pi),\n x2 = runif(n),\n y = 5 + 2 * sin(x1) + 8 * sqrt(x2) + rnorm(n, sd = .25)\n)\n\npivot_longer(simple, -y, names_to = \"predictor\", values_to = \"x\") |>\n ggplot(aes(x, y)) +\n geom_point(col = blue) +\n facet_wrap(~predictor, scales = \"free_x\")"
+ "section": "Another linear classifier",
+ "text": "Another linear classifier\nLike LDA, logistic regression is a linear classifier\nThe logit (i.e.: log odds) transformation gives a linear decision boundary \\[\\log\\left( \\frac{\\P(Y = 1 \\given X=x)}{\\P(Y = 0 \\given X=x) } \\right) = \\beta_0 + \\beta^{\\top} x\\] The decision boundary is the hyperplane \\(\\{x : \\beta_0 + \\beta^{\\top} x = 0\\}\\)\nIf the log-odds are below 0, classify as 0, above 0 classify as a 1."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#very-small-example-1",
- "href": "schedule/slides/13-gams-trees.html#very-small-example-1",
+ "objectID": "schedule/slides/16-logistic-regression.html#logistic-regression-is-also-easy-in-r",
+ "href": "schedule/slides/16-logistic-regression.html#logistic-regression-is-also-easy-in-r",
"title": "UBC Stat406 2023W",
- "section": "Very small example",
- "text": "Very small example\nSmooth each coordinate independently\n\nex_smooth <- gam(y ~ s(x1) + s(x2), data = simple)\n# s(z) means \"smooth\" z, uses spline basis for each with ridge penalty, GCV\nplot(ex_smooth, pages = 1, scale = 0, shade = TRUE, \n resid = TRUE, se = 2, las = 1)\n\nhead(coef(ex_smooth))\n\n(Intercept) s(x1).1 s(x1).2 s(x1).3 s(x1).4 s(x1).5 \n 10.2070490 -4.5764100 0.7117161 0.4548928 0.5535001 -0.2092996 \n\nex_smooth$gcv.ubre\n\n GCV.Cp \n0.06619721"
+ "section": "Logistic regression is also easy in R",
+ "text": "Logistic regression is also easy in R\n\nlogistic <- glm(y ~ ., dat, family = \"binomial\")\n\nOr we can use lasso or ridge regression or a GAM as before\n\nlasso_logit <- cv.glmnet(x, y, family = \"binomial\")\nridge_logit <- cv.glmnet(x, y, alpha = 0, family = \"binomial\")\ngam_logit <- gam(y ~ s(x), data = dat, family = \"binomial\")\n\n\n\nglm means generalized linear model"
},
{
- "objectID": "schedule/slides/13-gams-trees.html#wherefore-gams",
- "href": "schedule/slides/13-gams-trees.html#wherefore-gams",
+ "objectID": "schedule/slides/16-logistic-regression.html#baby-example-continued-from-last-time",
+ "href": "schedule/slides/16-logistic-regression.html#baby-example-continued-from-last-time",
"title": "UBC Stat406 2023W",
- "section": "Wherefore GAMs?",
- "text": "Wherefore GAMs?\nIf\n\\(\\Expect{Y \\given X=x} = \\beta_0 + f_1(x_{1})+\\cdots+f_p(x_{p}),\\)\nthen\n\\(\\textrm{MSE}(\\hat f) = \\frac{Cp}{n^{4/5}} + \\sigma^2.\\)\n\nExponent no longer depends on \\(p\\). Converges faster. (If the truth is additive.)\nYou could also use the same methods to include “some” interactions like\n\n\\[\\begin{aligned}&\\Expect{Y \\given X=x}\\\\ &= \\beta_0 + f_{12}(x_{1},\\ x_{2})+f_3(x_3)+\\cdots+f_p(x_{p}),\\end{aligned}\\]"
+ "section": "Baby example (continued from last time)",
+ "text": "Baby example (continued from last time)\n\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))\nlogit <- glm(y ~ ., dat1 |> mutate(y = y - 1), family = \"binomial\")\nsummary(logit)\n\n\nCall:\nglm(formula = y ~ ., family = \"binomial\", data = mutate(dat1, \n y = y - 1))\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -2.6649 0.6281 -4.243 2.21e-05 ***\nx1 2.5305 0.5995 4.221 2.43e-05 ***\nx2 1.6610 0.4365 3.805 0.000142 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 138.469 on 99 degrees of freedom\nResidual deviance: 68.681 on 97 degrees of freedom\nAIC: 74.681\n\nNumber of Fisher Scoring iterations: 6"
},
{
- "objectID": "schedule/slides/13-gams-trees.html#very-small-example-2",
- "href": "schedule/slides/13-gams-trees.html#very-small-example-2",
+ "objectID": "schedule/slides/16-logistic-regression.html#visualizing-the-classification-boundary",
+ "href": "schedule/slides/16-logistic-regression.html#visualizing-the-classification-boundary",
"title": "UBC Stat406 2023W",
- "section": "Very small example",
- "text": "Very small example\nSmooth two coordinates together\n\nex_smooth2 <- gam(y ~ s(x1, x2), data = simple)\nplot(ex_smooth2,\n scheme = 2, scale = 0, shade = TRUE,\n resid = TRUE, se = 2, las = 1\n)"
+ "section": "Visualizing the classification boundary",
+ "text": "Visualizing the classification boundary\n\n\nCode\ngr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), \n x2 = seq(-2.5, 3, length.out = 100))\npts <- predict(logit, gr)\ng0 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_steps2(n.breaks = 6, name = \"log odds\") \ng0"
},
{
- "objectID": "schedule/slides/13-gams-trees.html#regression-trees",
- "href": "schedule/slides/13-gams-trees.html#regression-trees",
+ "objectID": "schedule/slides/16-logistic-regression.html#calculation",
+ "href": "schedule/slides/16-logistic-regression.html#calculation",
"title": "UBC Stat406 2023W",
- "section": "Regression trees",
- "text": "Regression trees\nTrees involve stratifying or segmenting the predictor space into a number of simple regions.\nTrees are simple and useful for interpretation.\nBasic trees are not great at prediction.\nModern methods that use trees are much better (Module 4)"
+ "section": "Calculation",
+ "text": "Calculation\nWhile the R formula for logistic regression is straightforward, it’s not as easy to compute as OLS or LDA or QDA.\nLogistic regression for two classes simplifies to a likelihood:\nWrite \\(p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)\\)\n\n\\(P(Y_i = y_i \\given X = x_i, \\beta) = p_i^{y_i}(1-p_i)^{1-y_i}\\) (…Bernoulli distribution)\n\\(P(\\mathbf{Y} \\given \\mathbf{X}, \\beta) = \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\\)."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#regression-trees-1",
- "href": "schedule/slides/13-gams-trees.html#regression-trees-1",
+ "objectID": "schedule/slides/16-logistic-regression.html#calculation-1",
+ "href": "schedule/slides/16-logistic-regression.html#calculation-1",
"title": "UBC Stat406 2023W",
- "section": "Regression trees",
- "text": "Regression trees\nRegression trees estimate piece-wise constant functions\nThe slabs are axis-parallel rectangles \\(R_1,\\ldots,R_K\\) based on \\(\\X\\)\nIn each region, we average the \\(y_i\\)’s: \\(\\hat\\mu_1,\\ldots,\\hat\\mu_k\\)\nMinimize \\(\\sum_{k=1}^K \\sum_{i=1}^n (y_i-\\mu_k)^2\\) over \\(R_k,\\mu_k\\) for \\(k\\in \\{1,\\ldots,K\\}\\)\n\nThis sounds more complicated than it is.\nThe minimization is performed greedily (like forward stepwise regression)."
+ "section": "Calculation",
+ "text": "Calculation\nWrite \\(p_i(\\beta) = \\P(Y_i = 1 | X = x_i,\\beta)\\)\n\\[\n\\begin{aligned}\n\\ell(\\beta)\n& = \\log \\left( \\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \\right)\\\\\n&=\\sum_{i=1}^n \\left( y_i\\log(p_i(\\beta)) + (1-y_i)\\log(1-p_i(\\beta))\\right) \\\\\n& =\n\\sum_{i=1}^n \\left( y_i\\log(e^{\\beta^{\\top}x_i}/(1+e^{\\beta^{\\top}x_i})) - (1-y_i)\\log(1+e^{\\beta^{\\top}x_i})\\right) \\\\\n& =\n\\sum_{i=1}^n \\left( y_i\\beta^{\\top}x_i -\\log(1 + e^{\\beta^{\\top} x_i})\\right)\n\\end{aligned}\n\\]\nThis gets optimized via Newton-Raphson updates and iteratively reweighed least squares."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#mobility-data",
- "href": "schedule/slides/13-gams-trees.html#mobility-data",
+ "objectID": "schedule/slides/16-logistic-regression.html#irwls-for-logistic-regression-skip-for-now",
+ "href": "schedule/slides/16-logistic-regression.html#irwls-for-logistic-regression-skip-for-now",
"title": "UBC Stat406 2023W",
- "section": "Mobility data",
- "text": "Mobility data\n\nbigtree <- tree(Mobility ~ ., data = mob)\nsmalltree <- prune.tree(bigtree, k = .09)\ndraw.tree(smalltree, digits = 2)\n\n\nThis is called the dendrogram"
+ "section": "IRWLS for logistic regression (skip for now)",
+ "text": "IRWLS for logistic regression (skip for now)\n(This is preparation for Neural Networks.)\n\nlogit_irwls <- function(y, x, maxit = 100, tol = 1e-6) {\n p <- ncol(x)\n beta <- double(p) # initialize coefficients\n beta0 <- 0\n conv <- FALSE # hasn't converged\n iter <- 1 # first iteration\n while (!conv && (iter < maxit)) { # check loops\n iter <- iter + 1 # update first thing (so as not to forget)\n eta <- beta0 + x %*% beta\n mu <- 1 / (1 + exp(-eta))\n gp <- 1 / (mu * (1 - mu)) # inverse of derivative of logistic\n z <- eta + (y - mu) * gp # effective transformed response\n beta_new <- coef(lm(z ~ x, weights = 1 / gp)) # do Weighted Least Squares\n conv <- mean(abs(c(beta0, beta) - betaNew)) < tol # check if the betas are \"moving\"\n beta0 <- betaNew[1] # update betas\n beta <- betaNew[-1]\n }\n return(c(beta0, beta))\n}"
},
{
- "objectID": "schedule/slides/13-gams-trees.html#partition-view",
- "href": "schedule/slides/13-gams-trees.html#partition-view",
+ "objectID": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression",
+ "href": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression",
"title": "UBC Stat406 2023W",
- "section": "Partition view",
- "text": "Partition view\n\nmob$preds <- predict(smalltree)\npar(mfrow = c(1, 2), mar = c(5, 3, 0, 0))\ndraw.tree(smalltree, digits = 2)\ncols <- viridisLite::viridis(20, direction = -1)[cut(log(mob$Mobility), 20)]\nplot(mob$Black, mob$Commute,\n pch = 19, cex = .4, bty = \"n\", las = 1, col = cols,\n ylab = \"Commute time\", xlab = \"% Black\"\n)\npartition.tree(smalltree, add = TRUE, ordvars = c(\"Black\", \"Commute\"))\n\n\nWe predict all observations in a region with the same value.\n\\(\\bullet\\) The three regions correspond to the leaves of the tree."
+ "section": "Comparing LDA and Logistic regression",
+ "text": "Comparing LDA and Logistic regression\nBoth decision boundaries are linear in \\(x\\):\n\nLDA \\(\\longrightarrow \\alpha_0 + \\alpha_1^\\top x\\)\nLogit \\(\\longrightarrow \\beta_0 + \\beta_1^\\top x\\).\n\nBut the parameters are estimated differently."
},
{
- "objectID": "schedule/slides/13-gams-trees.html#section-1",
- "href": "schedule/slides/13-gams-trees.html#section-1",
+ "objectID": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression-1",
+ "href": "schedule/slides/16-logistic-regression.html#comparing-lda-and-logistic-regression-1",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "draw.tree(bigtree, digits = 2)\n\n\nTerminology\nWe call each split or end point a node. Each terminal node is referred to as a leaf.\nThe interior nodes lead to branches."
+ "section": "Comparing LDA and Logistic regression",
+ "text": "Comparing LDA and Logistic regression\nExamine the joint distribution of \\((X,\\ Y)\\) (not the posterior):\n\nLDA: \\(f(X_i,\\ Y_i) = \\underbrace{ f(X_i \\given Y_i)}_{\\textrm{Gaussian}}\\underbrace{ f(Y_i)}_{\\textrm{Bernoulli}}\\)\nLogistic Regression: \\(f(X_i,Y_i) = \\underbrace{ f(Y_i\\given X_i)}_{\\textrm{Logistic}}\\underbrace{ f(X_i)}_{\\textrm{Ignored}}\\)\nLDA estimates the joint, but Logistic estimates only the conditional (posterior) distribution. But this is really all we need.\nSo logistic requires fewer assumptions.\nBut if the two classes are perfectly separable, logistic crashes (and the MLE is undefined, too many solutions)\nLDA “works” even if the conditional isn’t normal, but works very poorly if any X is qualitative"
},
{
- "objectID": "schedule/slides/13-gams-trees.html#advantages-and-disadvantages-of-trees",
- "href": "schedule/slides/13-gams-trees.html#advantages-and-disadvantages-of-trees",
+ "objectID": "schedule/slides/16-logistic-regression.html#comparing-with-qda-2-classes",
+ "href": "schedule/slides/16-logistic-regression.html#comparing-with-qda-2-classes",
"title": "UBC Stat406 2023W",
- "section": "Advantages and disadvantages of trees",
- "text": "Advantages and disadvantages of trees\n🎉 Trees are very easy to explain (much easier than even linear regression).\n🎉 Some people believe that decision trees mirror human decision.\n🎉 Trees can easily be displayed graphically no matter the dimension of the data.\n🎉 Trees can easily handle qualitative predictors without the need to create dummy variables.\n💩 Trees aren’t very good at prediction.\n💩 Full trees badly overfit, so we “prune” them using CV\n\nWe’ll talk more about trees next module for Classification."
+ "section": "Comparing with QDA (2 classes)",
+ "text": "Comparing with QDA (2 classes)\n\nRecall: this gives a “quadratic” decision boundary (it’s a curve).\nIf we have \\(p\\) columns in \\(X\\)\n\nLogistic estimates \\(p+1\\) parameters\nLDA estimates \\(2p + p(p+1)/2 + 1\\)\nQDA estimates \\(2p + p(p+1) + 1\\)\n\nIf \\(p=50\\),\n\nLogistic: 51\nLDA: 1376\nQDA: 2651\n\nQDA doesn’t get used much: there are better nonlinear versions with way fewer parameters"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#meta-lecture",
- "href": "schedule/slides/15-LDA-and-QDA.html#meta-lecture",
+ "objectID": "schedule/slides/16-logistic-regression.html#bad-parameter-counting",
+ "href": "schedule/slides/16-logistic-regression.html#bad-parameter-counting",
"title": "UBC Stat406 2023W",
- "section": "15 LDA and QDA",
- "text": "15 LDA and QDA\nStat 406\nDaniel J. McDonald\nLast modified – 09 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Bad parameter counting",
+ "text": "Bad parameter counting\nI’ve motivated LDA as needing \\(\\Sigma\\), \\(\\pi\\) and \\(\\mu_0\\), \\(\\mu_1\\)\nIn fact, we don’t need all of this to get the decision boundary.\nSo the “degrees of freedom” is much lower if we only want the classes and not the probabilities.\nThe decision boundary only really depends on\n\n\\(\\Sigma^{-1}(\\mu_1-\\mu_0)\\)\n\\((\\mu_1+\\mu_0)\\),\nso appropriate algorithms estimate \\(<2p\\) parameters."
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#last-time",
- "href": "schedule/slides/15-LDA-and-QDA.html#last-time",
+ "objectID": "schedule/slides/16-logistic-regression.html#note-again",
+ "href": "schedule/slides/16-logistic-regression.html#note-again",
"title": "UBC Stat406 2023W",
- "section": "Last time",
- "text": "Last time\nWe showed that with two classes, the Bayes’ classifier is\n\\[g_*(X) = \\begin{cases}\n1 & \\textrm{ if } \\frac{p_1(X)}{p_0(X)} > \\frac{1-\\pi}{\\pi} \\\\\n0 & \\textrm{ otherwise}\n\\end{cases}\\]\nwhere \\(p_1(X) = Pr(X \\given Y=1)\\), \\(p_0(X) = Pr(X \\given Y=0)\\) and \\(\\pi = Pr(Y=1)\\)\n\nFor more than two classes.\n\\[g_*(X) =\n\\argmax_k \\frac{\\pi_k p_k(X)}{\\sum_k \\pi_k p_k(X)}\\]\nwhere \\(p_k(X) = Pr(X \\given Y=k)\\) and \\(\\pi_k = P(Y=k)\\)"
+ "section": "Note again:",
+ "text": "Note again:\nwhile logistic regression and LDA produce linear decision boundaries, they are not linear smoothers\nAIC/BIC/Cp work if you use the likelihood correctly and count degrees-of-freedom correctly\nMust people use either test set or CV"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#estimating-these",
- "href": "schedule/slides/15-LDA-and-QDA.html#estimating-these",
+ "objectID": "schedule/slides/18-the-bootstrap.html#meta-lecture",
+ "href": "schedule/slides/18-the-bootstrap.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Estimating these",
- "text": "Estimating these\nLet’s make some assumptions:\n\n\\(Pr(X\\given Y=k) = \\mbox{N}(\\mu_k,\\Sigma_k)\\)\n\\(\\Sigma_k = \\Sigma_{k'} = \\Sigma\\)\n\n\nThis leads to Linear Discriminant Analysis (LDA), one of the oldest classifiers"
+ "section": "18 The bootstrap",
+ "text": "18 The bootstrap\nStat 406\nDaniel J. McDonald\nLast modified – 02 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#lda",
- "href": "schedule/slides/15-LDA-and-QDA.html#lda",
+ "objectID": "schedule/slides/18-the-bootstrap.html#a-small-detour",
+ "href": "schedule/slides/18-the-bootstrap.html#a-small-detour",
"title": "UBC Stat406 2023W",
- "section": "LDA",
- "text": "LDA\n\nSplit your training data into \\(K\\) subsets based on \\(y_i=k\\).\nIn each subset, estimate the mean of \\(X\\): \\(\\widehat\\mu_k = \\overline{X}_k\\)\nEstimate the pooled variance: \\[\\widehat\\Sigma = \\frac{1}{n-K} \\sum_{k \\in \\mathcal{K}} \\sum_{i \\in k} (x_i - \\overline{X}_k) (x_i - \\overline{X}_k)^{\\top}\\]\nEstimate the class proportion: \\(\\widehat\\pi_k = n_k/n\\)"
+ "section": "A small detour…",
+ "text": "A small detour…"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#lda-1",
- "href": "schedule/slides/15-LDA-and-QDA.html#lda-1",
+ "objectID": "schedule/slides/18-the-bootstrap.html#in-statistics",
+ "href": "schedule/slides/18-the-bootstrap.html#in-statistics",
"title": "UBC Stat406 2023W",
- "section": "LDA",
- "text": "LDA\nAssume just \\(K = 2\\) so \\(k \\in \\{0,\\ 1\\}\\)\nWe predict \\(\\widehat{y} = 1\\) if\n\\[\\widehat{p_1}(x) / \\widehat{p_0}(x) > \\widehat{\\pi_0} / \\widehat{\\pi_1}\\]\nPlug in the density estimates:\n\\[\\widehat{p_k}(x) = N(x - \\widehat{\\mu}_k,\\ \\widehat\\Sigma)\\]"
+ "section": "In statistics…",
+ "text": "In statistics…\nThe “bootstrap” works. And well.\nIt’s good for “second-level” analysis.\n\n“First-level” analyses are things like \\(\\hat\\beta\\), \\(\\hat \\y\\), an estimator of the center (a median), etc.\n“Second-level” are things like \\(\\Var{\\hat\\beta}\\), a confidence interval for \\(\\hat \\y\\), or a median, etc.\n\n\nYou usually get these “second-level” properties from “the sampling distribution of an estimator”"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#lda-2",
- "href": "schedule/slides/15-LDA-and-QDA.html#lda-2",
+ "objectID": "schedule/slides/18-the-bootstrap.html#in-statistics-1",
+ "href": "schedule/slides/18-the-bootstrap.html#in-statistics-1",
"title": "UBC Stat406 2023W",
- "section": "LDA",
- "text": "LDA\nNow we take \\(\\log\\) and simplify \\((K=2)\\):\n\\[\n\\begin{aligned}\n&\\Rightarrow \\log(\\widehat{p_1}(x)\\times\\widehat{\\pi_1}) - \\log(\\widehat{p_0}(x)\\times\\widehat{\\pi_0})\n= \\cdots = \\cdots\\\\\n&= \\underbrace{\\left(x^\\top\\widehat\\Sigma^{-1}\\overline X_1-\\frac{1}{2}\\overline X_1^\\top \\widehat\\Sigma^{-1}\\overline X_1 + \\log \\widehat\\pi_1\\right)}_{\\delta_1(x)} - \\underbrace{\\left(x^\\top\\widehat\\Sigma^{-1}\\overline X_0-\\frac{1}{2}\\overline X_0^\\top \\widehat\\Sigma^{-1}\\overline X_0 + \\log \\widehat\\pi_0\\right)}_{\\delta_0(x)}\\\\\n&= \\delta_1(x) - \\delta_0(x)\n\\end{aligned}\n\\]\nIf \\(\\delta_1(x) > \\delta_0(x)\\), we set \\(\\widehat g(x)=1\\)"
+ "section": "In statistics…",
+ "text": "In statistics…\nThe “bootstrap” works. And well.\nIt’s good for “second-level” analysis.\n\n“First-level” analyses are things like \\(\\hat\\beta\\), \\(\\hat \\y\\), an estimator of the center (a median), etc.\n“Second-level” are things like \\(\\Var{\\hat\\beta}\\), a confidence interval for \\(\\hat \\y\\), or a median, etc.\n\n\nBut what if you don’t know the sampling distribution? Or you’re skeptical of the CLT argument?"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#one-dimensional-intuition",
- "href": "schedule/slides/15-LDA-and-QDA.html#one-dimensional-intuition",
+ "objectID": "schedule/slides/18-the-bootstrap.html#in-statistics-2",
+ "href": "schedule/slides/18-the-bootstrap.html#in-statistics-2",
"title": "UBC Stat406 2023W",
- "section": "One dimensional intuition",
- "text": "One dimensional intuition\n\nset.seed(406406406)\nn <- 100\npi <- .6\nmu0 <- -1\nmu1 <- 2\nsigma <- 2\ntib <- tibble(\n y = rbinom(n, 1, pi),\n x = rnorm(n, mu0, sigma) * (y == 0) + rnorm(n, mu1, sigma) * (y == 1)\n)\n\n\n\nCode\ngg <- ggplot(tib, aes(x, y)) +\n geom_point(colour = blue) +\n stat_function(fun = ~ 6 * (1 - pi) * dnorm(.x, mu0, sigma), colour = orange) +\n stat_function(fun = ~ 6 * pi * dnorm(.x, mu1, sigma), colour = orange) +\n annotate(\"label\",\n x = c(-3, 4.5), y = c(.5, 2 / 3),\n label = c(\"(1-pi)*p[0](x)\", \"pi*p[1](x)\"), parse = TRUE\n )\ngg"
+ "section": "In statistics…",
+ "text": "In statistics…\nThe “bootstrap” works. And well.\nIt’s good for “second-level” analysis.\n\n“First-level” analyses are things like \\(\\hat\\beta\\), \\(\\hat \\y\\), an estimator of the center (a median), etc.\n“Second-level” are things like \\(\\Var{\\hat\\beta}\\), a confidence interval for \\(\\hat \\y\\), or a median, etc.\n\n\nSampling distributions\n\nIf \\(X_i\\) are iid Normal \\((0,\\sigma^2)\\), then \\(\\Var{\\overline{X}} = \\sigma^2 / n\\).\nIf \\(X_i\\) are iid and \\(n\\) is big, then \\(\\Var{\\overline{X}} \\approx \\Var{X_1} / n\\).\nIf \\(X_i\\) are iid Binomial \\((m, p)\\), then \\(\\Var{\\overline{X}} = mp(1-p) / n\\)"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#what-is-linear",
- "href": "schedule/slides/15-LDA-and-QDA.html#what-is-linear",
+ "objectID": "schedule/slides/18-the-bootstrap.html#example-of-unknown-sampling-distribution",
+ "href": "schedule/slides/18-the-bootstrap.html#example-of-unknown-sampling-distribution",
"title": "UBC Stat406 2023W",
- "section": "What is linear?",
- "text": "What is linear?\nLook closely at the equation for \\(\\delta_1(x)\\):\n\\[\\delta_1(x)=x^\\top\\widehat\\Sigma^{-1}\\overline X_1-\\frac{1}{2}\\overline X_1^\\top \\widehat\\Sigma^{-1}\\overline X_1 + \\log \\widehat\\pi_1\\]\nWe can write this as \\(\\delta_1(x) = x^\\top a_1 + b_1\\) with \\(a_1 = \\widehat\\Sigma^{-1}\\overline X_1\\) and \\(b_1=-\\frac{1}{2}\\overline X_1^\\top \\widehat\\Sigma^{-1}\\overline X_1 + \\log \\widehat\\pi_1\\).\nWe can do the same for \\(\\delta_0(x)\\) (in terms of \\(a_0\\) and \\(b_0\\))\nTherefore,\n\\[\\delta_1(x)-\\delta_0(x) = x^\\top(a_1-a_0) + (b_1-b_0)\\]\nThis is how we discriminate between the classes.\nWe just calculate \\((a_1 - a_0)\\) (a vector in \\(\\R^p\\)), and \\(b_1 - b_0\\) (a scalar)"
+ "section": "Example of unknown sampling distribution",
+ "text": "Example of unknown sampling distribution\nI estimate a LDA on some data.\nI get a new \\(x_0\\) and produce \\(\\widehat{Pr}(y_0 =1 \\given x_0)\\).\nCan I get a 95% confidence interval for \\(Pr(y_0=1 \\given x_0)\\)?\n\nThe bootstrap gives this to you."
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#baby-example",
- "href": "schedule/slides/15-LDA-and-QDA.html#baby-example",
+ "objectID": "schedule/slides/18-the-bootstrap.html#bootstrap-procedure",
+ "href": "schedule/slides/18-the-bootstrap.html#bootstrap-procedure",
"title": "UBC Stat406 2023W",
- "section": "Baby example",
- "text": "Baby example\n\n\n\nlibrary(mvtnorm)\nlibrary(MASS)\ngenerate_lda_2d <- function(\n n, p = c(.5, .5), \n mu = matrix(c(0, 0, 1, 1), 2),\n Sigma = diag(2)) {\n X <- rmvnorm(n, sigma = Sigma)\n tibble(\n y = which(rmultinom(n, 1, p) == 1, TRUE)[,1],\n x1 = X[, 1] + mu[1, y],\n x2 = X[, 2] + mu[2, y]\n )\n}\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2))\nlda_fit <- lda(y ~ ., dat1)"
+ "section": "Bootstrap procedure",
+ "text": "Bootstrap procedure\n\nResample your training data w/ replacement.\nCalculate LDA on this sample.\nProduce a new prediction, call it \\(\\widehat{Pr}_b(y_0 =1 \\given x_0)\\).\nRepeat 1-3 \\(b = 1,\\ldots,B\\) times.\nCI: \\(\\left[2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(1-\\alpha/2),\\ 2\\widehat{Pr}(y_0 =1 \\given x_0) - \\widehat{F}_{boot}(\\alpha/2)\\right]\\)\n\n\n\\(\\hat{F}\\) is the “empirical” distribution of the bootstraps."
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#multiple-classes",
- "href": "schedule/slides/15-LDA-and-QDA.html#multiple-classes",
+ "objectID": "schedule/slides/18-the-bootstrap.html#empirical-distribution",
+ "href": "schedule/slides/18-the-bootstrap.html#empirical-distribution",
"title": "UBC Stat406 2023W",
- "section": "Multiple classes",
- "text": "Multiple classes\n\nmoreclasses <- generate_lda_2d(150, c(.2, .3, .5), matrix(c(0, 0, 1, 1, 1, 0), 2), .5 * diag(2))\nseparateclasses <- generate_lda_2d(150, c(.2, .3, .5), matrix(c(-1, -1, 2, 2, 2, -1), 2), .1 * diag(2))"
+ "section": "Empirical distribution",
+ "text": "Empirical distribution\n\n\nCode\nr <- rexp(50, 1 / 5)\nggplot(tibble(r = r), aes(r)) + \n stat_ecdf(colour = orange) +\n geom_vline(xintercept = quantile(r, probs = c(.05, .95))) +\n geom_hline(yintercept = c(.05, .95), linetype = \"dashed\") +\n annotate(\n \"label\", x = c(5, 12), y = c(.25, .75), \n label = c(\"hat(F)[boot](.05)\", \"hat(F)[boot](.95)\"), \n parse = TRUE\n )"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#qda",
- "href": "schedule/slides/15-LDA-and-QDA.html#qda",
+ "objectID": "schedule/slides/18-the-bootstrap.html#very-basic-example",
+ "href": "schedule/slides/18-the-bootstrap.html#very-basic-example",
"title": "UBC Stat406 2023W",
- "section": "QDA",
- "text": "QDA\nJust like LDA, but \\(\\Sigma_k\\) is separate for each class.\nProduces Quadratic decision boundary.\nEverything else is the same.\n\nqda_fit <- qda(y ~ ., dat1)\nqda_3fit <- qda(y ~ ., moreclasses)"
+ "section": "Very basic example",
+ "text": "Very basic example\n\nLet \\(X_i\\sim \\textrm{Exponential}(1/5)\\). The pdf is \\(f(x) = \\frac{1}{5}e^{-x/5}\\)\nI know if I estimate the mean with \\(\\bar{X}\\), then by the CLT (if \\(n\\) is big),\n\n\\[\\frac{\\sqrt{n}(\\bar{X}-E[X])}{s} \\approx N(0, 1).\\]\n\nThis gives me a 95% confidence interval like \\[\\bar{X} \\pm 2s/\\sqrt{n}\\]\nBut I don’t want to estimate the mean, I want to estimate the median."
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#class-comparison",
- "href": "schedule/slides/15-LDA-and-QDA.html#class-comparison",
+ "objectID": "schedule/slides/18-the-bootstrap.html#section-1",
+ "href": "schedule/slides/18-the-bootstrap.html#section-1",
"title": "UBC Stat406 2023W",
- "section": "3 class comparison",
- "text": "3 class comparison"
+ "section": "",
+ "text": "Code\nggplot(data.frame(x = c(0, 12)), aes(x)) +\n stat_function(fun = function(x) dexp(x, 1 / 5), color = orange) +\n geom_vline(xintercept = 5, color = blue) + # mean\n geom_vline(xintercept = qexp(.5, 1 / 5), color = red) + # median\n annotate(\"label\",\n x = c(2.5, 5.5, 10), y = c(.15, .15, .05),\n label = c(\"median\", \"bar(x)\", \"pdf\"), parse = TRUE,\n color = c(red, blue, orange), size = 6\n )"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#notes",
- "href": "schedule/slides/15-LDA-and-QDA.html#notes",
+ "objectID": "schedule/slides/18-the-bootstrap.html#now-what",
+ "href": "schedule/slides/18-the-bootstrap.html#now-what",
"title": "UBC Stat406 2023W",
- "section": "Notes",
- "text": "Notes\n\nLDA is a linear classifier. It is not a linear smoother.\n\nIt is derived from Bayes rule.\nAssume each class-conditional density in Gaussian\nIt assumes the classes have different mean vectors, but the same (common) covariance matrix.\nIt estimates densities and probabilities and “plugs in”\n\nQDA is not a linear classifier. It depends on quadratic functions of the data.\n\nIt is derived from Bayes rule.\nAssume each class-conditional density in Gaussian\nIt assumes the classes have different mean vectors and different covariance matrices.\nIt estimates densities and probabilities and “plugs in”"
+ "section": "Now what…",
+ "text": "Now what…\n\nI give you a sample of size 500, you give me the sample median.\nHow do you get a CI?\nYou can use the bootstrap!\n\n\nset.seed(406406406)\nx <- rexp(n, 1 / 5)\n(med <- median(x)) # sample median\n\n[1] 3.611615\n\nB <- 100\nalpha <- 0.05\nFhat <- map_dbl(1:B, ~ median(sample(x, replace = TRUE))) # repeat B times, \"empirical distribution\"\nCI <- 2 * med - quantile(Fhat, probs = c(1 - alpha / 2, alpha / 2))"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#section",
- "href": "schedule/slides/15-LDA-and-QDA.html#section",
+ "objectID": "schedule/slides/18-the-bootstrap.html#section-2",
+ "href": "schedule/slides/18-the-bootstrap.html#section-2",
"title": "UBC Stat406 2023W",
"section": "",
- "text": "It is hard (maybe impossible) to come up with reasonable classifiers that are linear smoothers. Many “look” like a linear smoother, but then apply a nonlinear transformation."
+ "text": "Code\nggplot(data.frame(Fhat), aes(Fhat)) +\n geom_density(color = orange) +\n geom_vline(xintercept = CI, color = orange, linetype = 2) +\n geom_vline(xintercept = med, col = blue) +\n geom_vline(xintercept = qexp(.5, 1 / 5), col = red) +\n annotate(\"label\",\n x = c(3.15, 3.5, 3.75), y = c(.5, .5, 1),\n color = c(orange, red, blue),\n label = c(\"widehat(F)\", \"true~median\", \"widehat(median)\"),\n parse = TRUE\n ) +\n xlab(\"x\") +\n geom_rug(aes(2 * med - Fhat))"
},
{
- "objectID": "schedule/slides/15-LDA-and-QDA.html#naïve-bayes",
- "href": "schedule/slides/15-LDA-and-QDA.html#naïve-bayes",
+ "objectID": "schedule/slides/18-the-bootstrap.html#how-does-this-work",
+ "href": "schedule/slides/18-the-bootstrap.html#how-does-this-work",
"title": "UBC Stat406 2023W",
- "section": "Naïve Bayes",
- "text": "Naïve Bayes\nAssume that \\(Pr(X | Y = k) = Pr(X_1 | Y = k)\\cdots Pr(X_p | Y = k)\\).\nThat is, conditional on the class, the feature distribution is independent.\n\nIf we further assume that \\(Pr(X_j | Y = k)\\) is Gaussian,\nThis is the same as QDA but with \\(\\Sigma_k\\) Diagonal.\n\n\nDon’t have to assume Gaussian. Could do lots of stuff."
+ "section": "How does this work?",
+ "text": "How does this work?"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#meta-lecture",
- "href": "schedule/slides/17-nonlinear-classifiers.html#meta-lecture",
+ "objectID": "schedule/slides/18-the-bootstrap.html#approximations",
+ "href": "schedule/slides/18-the-bootstrap.html#approximations",
"title": "UBC Stat406 2023W",
- "section": "17 Nonlinear classifiers",
- "text": "17 Nonlinear classifiers\nStat 406\nDaniel J. McDonald\nLast modified – 30 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Approximations",
+ "text": "Approximations"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#last-time",
- "href": "schedule/slides/17-nonlinear-classifiers.html#last-time",
+ "objectID": "schedule/slides/18-the-bootstrap.html#slightly-harder-example",
+ "href": "schedule/slides/18-the-bootstrap.html#slightly-harder-example",
"title": "UBC Stat406 2023W",
- "section": "Last time",
- "text": "Last time\nWe reviewed logistic regression\n\\[\\begin{aligned}\nPr(Y = 1 \\given X=x) & = \\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}} \\\\\nPr(Y = 0 \\given X=x) & = \\frac{1}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}=1-\\frac{\\exp\\{\\beta_0 + \\beta^{\\top}x\\}}{1 + \\exp\\{\\beta_0 + \\beta^{\\top}x\\}}\\end{aligned}\\]"
+ "section": "Slightly harder example",
+ "text": "Slightly harder example\n\n\n\nggplot(fatcats, aes(Bwt, Hwt)) +\n geom_point(color = blue) +\n xlab(\"Cat body weight (Kg)\") +\n ylab(\"Cat heart weight (g)\")\n\n\n\n\n\n\n\n\n\n\n\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nsummary(cats.lm)\n\n\nCall:\nlm(formula = Hwt ~ 0 + Bwt, data = fatcats)\n\nResiduals:\n Min 1Q Median 3Q Max \n-11.2353 -0.7932 -0.1407 0.5968 11.1026 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \nBwt 3.95424 0.06294 62.83 <2e-16 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 2.089 on 143 degrees of freedom\nMultiple R-squared: 0.965, Adjusted R-squared: 0.9648 \nF-statistic: 3947 on 1 and 143 DF, p-value: < 2.2e-16\n\nconfint(cats.lm)\n\n 2.5 % 97.5 %\nBwt 3.829836 4.078652"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#make-it-nonlinear",
- "href": "schedule/slides/17-nonlinear-classifiers.html#make-it-nonlinear",
+ "objectID": "schedule/slides/18-the-bootstrap.html#when-we-fit-models-we-examine-diagnostics",
+ "href": "schedule/slides/18-the-bootstrap.html#when-we-fit-models-we-examine-diagnostics",
"title": "UBC Stat406 2023W",
- "section": "Make it nonlinear",
- "text": "Make it nonlinear\nWe can make LDA or logistic regression have non-linear decision boundaries by mapping the features to a higher dimension (just like with regular regression)\nSay:\nPolynomials\n\\((x_1, x_2) \\mapsto \\left(1,\\ x_1,\\ x_1^2,\\ x_2,\\ x_2^2,\\ x_1 x_2\\right)\\)\n\ndat1 <- generate_lda_2d(100, Sigma = .5 * diag(2)) |> mutate(y = as.factor(y))\nlogit_poly <- glm(y ~ x1 * x2 + I(x1^2) + I(x2^2), dat1, family = \"binomial\")\nlda_poly <- lda(y ~ x1 * x2 + I(x1^2) + I(x2^2), dat1)"
+ "section": "When we fit models, we examine diagnostics",
+ "text": "When we fit models, we examine diagnostics\n\n\n\nqqnorm(residuals(cats.lm), pch = 16, col = blue)\nqqline(residuals(cats.lm), col = orange, lwd = 2)\n\n\n\n\n\n\n\n\nThe tails are too fat. So I don’t believe that CI…\n\n\nWe bootstrap\n\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n\n 97.5% 2.5% \n3.826735 4.084322 \n\nconfint(cats.lm) # Original CI\n\n 2.5 % 97.5 %\nBwt 3.829836 4.078652"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#visualizing-the-classification-boundary",
- "href": "schedule/slides/17-nonlinear-classifiers.html#visualizing-the-classification-boundary",
+ "objectID": "schedule/slides/18-the-bootstrap.html#an-alternative",
+ "href": "schedule/slides/18-the-bootstrap.html#an-alternative",
"title": "UBC Stat406 2023W",
- "section": "Visualizing the classification boundary",
- "text": "Visualizing the classification boundary\n\n\nCode\nlibrary(cowplot)\ngr <- expand_grid(x1 = seq(-2.5, 3, length.out = 100), x2 = seq(-2.5, 3, length.out = 100))\npts_logit <- predict(logit_poly, gr)\npts_lda <- predict(lda_poly, gr)\ng0 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts_logit), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_viridis_b(n.breaks = 6, alpha = .5, name = \"log odds\") +\n ggtitle(\"Polynomial logit\") +\n theme(legend.position = \"bottom\", legend.key.width = unit(1.5, \"cm\"))\ng1 <- ggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = pts_lda$x), aes(x1, x2, fill = disc)) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_viridis_b(n.breaks = 6, alpha = .5, name = bquote(delta[1] - delta[0])) +\n ggtitle(\"Polynomial lda\") +\n theme(legend.position = \"bottom\", legend.key.width = unit(1.5, \"cm\"))\nplot_grid(g0, g1)\n\n\n\nA linear decision boundary in the higher-dimensional space corresponds to a non-linear decision boundary in low dimensions."
+ "section": "An alternative",
+ "text": "An alternative\n\nSo far, I didn’t use any information about the data-generating process.\nWe’ve done the non-parametric bootstrap\nThis is easiest, and most common for the methods in this module\n\n\nBut there’s another version\n\nYou could try a “parametric bootstrap”\nThis assumes knowledge about the DGP"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#trees-reforestation",
- "href": "schedule/slides/17-nonlinear-classifiers.html#trees-reforestation",
+ "objectID": "schedule/slides/18-the-bootstrap.html#same-data",
+ "href": "schedule/slides/18-the-bootstrap.html#same-data",
"title": "UBC Stat406 2023W",
- "section": "Trees (reforestation)",
- "text": "Trees (reforestation)\n\n\nWe saw regression trees last module\nClassification trees are\n\nMore natural\nSlightly different computationally\n\nEverything else is pretty much the same"
+ "section": "Same data",
+ "text": "Same data\n\n\nNon-parametric bootstrap\nSame as before\n\nB <- 500\nalpha <- .05\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |>\n slice_sample(prop = 1, replace = TRUE)\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # NP Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n\n 97.5% 2.5% \n3.832907 4.070232 \n\nconfint(cats.lm) # Original CI\n\n 2.5 % 97.5 %\nBwt 3.829836 4.078652\n\n\n\n\nParametric bootstrap\n\nAssume that the linear model is TRUE.\nThen, \\(\\texttt{Hwt}_i = \\widehat{\\beta}\\times \\texttt{Bwt}_i + \\widehat{e}_i\\), \\(\\widehat{e}_i \\approx \\epsilon_i\\)\nThe \\(\\epsilon_i\\) is random \\(\\longrightarrow\\) just resample \\(\\widehat{e}_i\\).\n\n\nB <- 500\nbhats <- double(B)\ncats.lm <- lm(Hwt ~ 0 + Bwt, data = fatcats)\nr <- residuals(cats.lm)\nbhats <- map_dbl(1:B, ~ {\n newcats <- fatcats |> mutate(\n Hwt = predict(cats.lm) +\n sample(r, n(), replace = TRUE)\n )\n coef(lm(Hwt ~ 0 + Bwt, data = newcats))\n})\n\n2 * coef(cats.lm) - # Parametric Bootstrap CI\n quantile(bhats, probs = c(1 - alpha / 2, alpha / 2))\n\n 97.5% 2.5% \n3.815162 4.065045"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#axis-parallel-splits",
- "href": "schedule/slides/17-nonlinear-classifiers.html#axis-parallel-splits",
+ "objectID": "schedule/slides/18-the-bootstrap.html#bootstrap-error-sources",
+ "href": "schedule/slides/18-the-bootstrap.html#bootstrap-error-sources",
"title": "UBC Stat406 2023W",
- "section": "Axis-parallel splits",
- "text": "Axis-parallel splits\nLike with regression trees, classification trees operate by greedily splitting the predictor space\n\n\n\nnames(bakeoff)\n\n [1] \"winners\" \n [2] \"series\" \n [3] \"age\" \n [4] \"occupation\" \n [5] \"hometown\" \n [6] \"percent_star\" \n [7] \"percent_technical_wins\" \n [8] \"percent_technical_bottom3\"\n [9] \"percent_technical_top3\" \n[10] \"technical_highest\" \n[11] \"technical_lowest\" \n[12] \"technical_median\" \n[13] \"judge1\" \n[14] \"judge2\" \n[15] \"viewers_7day\" \n[16] \"viewers_28day\" \n\n\n\nsmalltree <- tree(\n winners ~ technical_median + percent_star,\n data = bakeoff\n)\n\n\n\n\n\nCode\npar(mar = c(5, 5, 0, 0) + .1)\nplot(bakeoff$technical_median, bakeoff$percent_star,\n pch = c(\"-\", \"+\")[bakeoff$winners + 1], cex = 2, bty = \"n\", las = 1,\n ylab = \"% star baker\", xlab = \"times above median in technical\",\n col = orange, cex.axis = 2, cex.lab = 2\n)\npartition.tree(smalltree,\n add = TRUE, col = blue,\n ordvars = c(\"technical_median\", \"percent_star\")\n)"
+ "section": "Bootstrap error sources",
+ "text": "Bootstrap error sources\n\n\nSimulation error\n\nusing only \\(B\\) samples to estimate \\(F\\) with \\(\\hat{F}\\).\n\nStatistical error\n\nour data depended on a sample from the population. We don’t have the whole population so we make an error by using a sample (Note: this part is what always happens with data, and what the science of statistics analyzes.)\n\nSpecification error\n\nIf we use the parametric bootstrap, and our model is wrong, then we are overconfident."
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#when-do-trees-do-well",
- "href": "schedule/slides/17-nonlinear-classifiers.html#when-do-trees-do-well",
+ "objectID": "schedule/slides/18-the-bootstrap.html#types-of-intervals",
+ "href": "schedule/slides/18-the-bootstrap.html#types-of-intervals",
"title": "UBC Stat406 2023W",
- "section": "When do trees do well?",
- "text": "When do trees do well?\n\n\n\n\n\n2D example\nTop Row:\ntrue decision boundary is linear\n🍎 linear classifier\n👎 tree with axis-parallel splits\nBottom Row:\ntrue decision boundary is non-linear\n🤮 A linear classifier can’t capture the true decision boundary\n🍎 decision tree is successful."
+ "section": "Types of intervals",
+ "text": "Types of intervals\nLet \\(\\hat{\\theta}\\) be our sample statistic, \\(\\hat{\\Theta}\\) be the resamples\nOur interval is\n\\[\n[2\\hat{\\theta} - \\theta^*_{1-\\alpha/2},\\ 2\\hat{\\theta} - \\theta^*_{\\alpha/2}]\n\\]\nwhere \\(\\theta^*_q\\) is the \\(q\\) quantile of \\(\\hat{\\Theta}\\).\n\n\nCalled the “Pivotal Interval”\nHas the correct \\(1-\\alpha\\)% coverage under very mild conditions on \\(\\hat{\\theta}\\)"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-build-a-tree",
- "href": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-build-a-tree",
+ "objectID": "schedule/slides/18-the-bootstrap.html#types-of-intervals-1",
+ "href": "schedule/slides/18-the-bootstrap.html#types-of-intervals-1",
"title": "UBC Stat406 2023W",
- "section": "How do we build a tree?",
- "text": "How do we build a tree?\n\nDivide the predictor space into \\(J\\) non-overlapping regions \\(R_1, \\ldots, R_J\\)\n\n\nthis is done via greedy, recursive binary splitting\n\n\nEvery observation that falls into a given region \\(R_j\\) is given the same prediction\n\n\ndetermined by majority (or plurality) vote in that region.\n\nImportant:\n\nTrees can only make rectangular regions that are aligned with the coordinate axis.\nThe fit is greedy, which means that after a split is made, all further decisions are conditional on that split."
+ "section": "Types of intervals",
+ "text": "Types of intervals\nLet \\(\\hat{\\theta}\\) be our sample statistic, \\(\\hat{\\Theta}\\) be the resamples\n\\[\n[\\hat{\\theta} - z_{\\alpha/2}\\hat{s},\\ \\hat{\\theta} + z_{\\alpha/2}\\hat{s}]\n\\]\nwhere \\(\\hat{s} = \\sqrt{\\Var{\\hat{\\Theta}}}\\)\n\n\nCalled the “Normal Interval”\nOnly works if the distribution of \\(\\hat{\\Theta}\\) is approximately Normal.\nUnlikely to work well\nDon’t do this"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-measure-quality-of-fit",
- "href": "schedule/slides/17-nonlinear-classifiers.html#how-do-we-measure-quality-of-fit",
+ "objectID": "schedule/slides/18-the-bootstrap.html#types-of-intervals-2",
+ "href": "schedule/slides/18-the-bootstrap.html#types-of-intervals-2",
"title": "UBC Stat406 2023W",
- "section": "How do we measure quality of fit?",
- "text": "How do we measure quality of fit?\nLet \\(p_{mk}\\) be the proportion of training observations in the \\(m^{th}\\) region that are from the \\(k^{th}\\) class.\n\n\n\n\n\n\n\nclassification error rate:\n\\(E = 1 - \\max_k (\\widehat{p}_{mk})\\)\n\n\nGini index:\n\\(G = \\sum_k \\widehat{p}_{mk}(1-\\widehat{p}_{mk})\\)\n\n\ncross-entropy:\n\\(D = -\\sum_k \\widehat{p}_{mk}\\log(\\widehat{p}_{mk})\\)\n\n\n\nBoth Gini and cross-entropy measure the purity of the classifier (small if all \\(p_{mk}\\) are near zero or 1).\nThese are preferred over the classification error rate.\nClassification error is hard to optimize.\nWe build a classifier by growing a tree that minimizes \\(G\\) or \\(D\\)."
+ "section": "Types of intervals",
+ "text": "Types of intervals\nLet \\(\\hat{\\theta}\\) be our sample statistic, \\(\\hat{\\Theta}\\) be the resamples\n\\[\n[\\theta^*_{\\alpha/2},\\ \\theta^*_{1-\\alpha/2}]\n\\]\nwhere \\(\\theta^*_q\\) is the \\(q\\) quantile of \\(\\hat{\\Theta}\\).\n\n\nCalled the “Percentile Interval”\nWorks if \\(\\exists\\) monotone \\(m\\) so that \\(m(\\hat\\Theta) \\sim N(m(\\theta), c^2)\\)\nBetter than the Normal Interval\nMore assumptions than the Pivotal Interval"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#pruning-the-tree",
- "href": "schedule/slides/17-nonlinear-classifiers.html#pruning-the-tree",
+ "objectID": "schedule/slides/18-the-bootstrap.html#more-details",
+ "href": "schedule/slides/18-the-bootstrap.html#more-details",
"title": "UBC Stat406 2023W",
- "section": "Pruning the tree",
- "text": "Pruning the tree\n\nCross-validation can be used to directly prune the tree,\nBut it is computationally expensive (combinatorial complexity).\nInstead, we use weakest link pruning, (Gini version)\n\n\\[\\sum_{m=1}^{|T|} \\sum_{k \\in R_m} \\widehat{p}_{mk}(1-\\widehat{p}_{mk}) + \\alpha |T|\\]\n\n\\(|T|\\) is the number of terminal nodes.\nEssentially, we are trading training fit (first term) with model complexity (second) term (compare to lasso).\nNow, cross-validation can be used to pick \\(\\alpha\\)."
+ "section": "More details",
+ "text": "More details\n\nSee “All of Statistics” by Larry Wasserman, Chapter 8.3\n\nThere’s a handout with the proofs on Canvas (under Modules)"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#advantages-and-disadvantages-of-trees-again",
- "href": "schedule/slides/17-nonlinear-classifiers.html#advantages-and-disadvantages-of-trees-again",
+ "objectID": "schedule/slides/20-boosting.html#meta-lecture",
+ "href": "schedule/slides/20-boosting.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Advantages and disadvantages of trees (again)",
- "text": "Advantages and disadvantages of trees (again)\n🎉 Trees are very easy to explain (much easier than even linear regression).\n🎉 Some people believe that decision trees mirror human decision.\n🎉 Trees can easily be displayed graphically no matter the dimension of the data.\n🎉 Trees can easily handle qualitative predictors without the need to create dummy variables.\n💩 Trees aren’t very good at prediction.\n💩 Trees are highly variable. Small changes in training data \\(\\Longrightarrow\\) big changes in the tree.\nTo fix these last two, we can try to grow many trees and average their performance.\n\nWe do this next module"
+ "section": "20 Boosting",
+ "text": "20 Boosting\nStat 406\nDaniel J. McDonald\nLast modified – 02 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#knn-classifiers",
- "href": "schedule/slides/17-nonlinear-classifiers.html#knn-classifiers",
+ "objectID": "schedule/slides/20-boosting.html#last-time",
+ "href": "schedule/slides/20-boosting.html#last-time",
"title": "UBC Stat406 2023W",
- "section": "KNN classifiers",
- "text": "KNN classifiers\n\nWe saw \\(k\\)-nearest neighbors in the last module.\n\n\nlibrary(class)\nknn3 <- knn(dat1[, -1], gr, dat1$y, k = 3)\n\n\n\nCode\ngr$nn03 <- knn3\nggplot(dat1, aes(x1, x2)) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = tibble(gr, disc = knn3), aes(x1, x2, fill = disc), alpha = .5) +\n geom_point(aes(shape = as.factor(y)), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_manual(values = c(orange, blue), labels = c(\"0\", \"1\")) +\n theme(\n legend.position = \"bottom\", legend.title = element_blank(),\n legend.key.width = unit(2, \"cm\")\n )"
+ "section": "Last time",
+ "text": "Last time\nWe learned about bagging, for averaging low-bias / high-variance estimators.\nToday, we examine it’s opposite: Boosting.\nBoosting also combines estimators, but it combines high-bias / low-variance estimators.\nBoosting has a number of flavours. And if you Google descriptions, most are wrong.\nFor a deep (and accurate) treatment, see [ESL] Chapter 10\n\nWe’ll discuss 2 flavours: AdaBoost and Gradient Boosting\nNeither requires a tree, but that’s the typical usage.\nBoosting needs a “weak learner”, so small trees (stumps) are natural."
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#choosing-k-is-very-important",
- "href": "schedule/slides/17-nonlinear-classifiers.html#choosing-k-is-very-important",
+ "objectID": "schedule/slides/20-boosting.html#adaboost-intuition-for-classification",
+ "href": "schedule/slides/20-boosting.html#adaboost-intuition-for-classification",
"title": "UBC Stat406 2023W",
- "section": "Choosing \\(k\\) is very important",
- "text": "Choosing \\(k\\) is very important\n\n\nCode\nset.seed(406406406)\nks <- c(1, 2, 5, 10, 20)\nnn <- map(ks, ~ as_tibble(knn(dat1[, -1], gr[, 1:2], dat1$y, .x)) |> \n set_names(sprintf(\"k = %02s\", .x))) |>\n list_cbind() |>\n bind_cols(gr)\npg <- pivot_longer(nn, starts_with(\"k =\"), names_to = \"k\", values_to = \"knn\")\n\nggplot(pg, aes(x1, x2)) +\n geom_raster(aes(fill = knn), alpha = .6) +\n facet_wrap(~ k) +\n scale_fill_manual(values = c(orange, green), labels = c(\"0\", \"1\")) +\n geom_point(data = dat1, mapping = aes(x1, x2, shape = as.factor(y)), size = 4) +\n theme_bw(base_size = 18) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n theme(\n legend.title = element_blank(),\n legend.key.height = unit(3, \"cm\")\n )\n\n\n\n\nHow should we choose \\(k\\)?\nScaling is also very important. “Nearness” is determined by distance, so better to standardize your data first.\nIf there are ties, break randomly. So even \\(k\\) is strange."
+ "section": "AdaBoost intuition (for classification)",
+ "text": "AdaBoost intuition (for classification)\nAt each iteration, we weight the observations.\nObservations that are currently misclassified, get higher weights.\nSo on the next iteration, we’ll try harder to correctly classify our mistakes.\nThe number of iterations must be chosen."
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#knn.cv-leave-one-out",
- "href": "schedule/slides/17-nonlinear-classifiers.html#knn.cv-leave-one-out",
+ "objectID": "schedule/slides/20-boosting.html#adaboost-freund-and-schapire-generic",
+ "href": "schedule/slides/20-boosting.html#adaboost-freund-and-schapire-generic",
"title": "UBC Stat406 2023W",
- "section": "knn.cv() (leave one out)",
- "text": "knn.cv() (leave one out)\n\nkmax <- 20\nerr <- map_dbl(1:kmax, ~ mean(knn.cv(dat1[, -1], dat1$y, k = .x) != dat1$y))\n\n\nI would use the largest (odd) k that is close to the minimum.\nThis produces simpler, smoother, decision boundaries."
+ "section": "AdaBoost (Freund and Schapire, generic)",
+ "text": "AdaBoost (Freund and Schapire, generic)\nLet \\(G(x, \\theta)\\) be any weak learner\n⛭ imagine a tree with one split: then \\(\\theta=\\) (feature, split point)\nAlgorithm (AdaBoost) 🛠️\n\nSet observation weights \\(w_i=1/n\\).\nUntil we quit ( \\(m<M\\) iterations )\n\nEstimate the classifier \\(G(x,\\theta_m)\\) using weights \\(w_i\\)\nCalculate it’s weighted error \\(\\textrm{err}_m = \\sum_{i=1}^n w_i I(y_i \\neq G(x_i, \\theta_m)) / \\sum w_i\\)\nSet \\(\\alpha_m = \\log((1-\\textrm{err}_m)/\\text{err}_m)\\)\nUpdate \\(w_i \\leftarrow w_i \\exp(\\alpha_m I(y_i \\neq G(x_i,\\theta_m)))\\)\n\nFinal classifier is \\(G(x) = \\textrm{sign}\\left( \\sum_{m=1}^M \\alpha_m G(x, \\theta_m)\\right)\\)"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#alternative-using-deviance-loss-i-think-this-is-right",
- "href": "schedule/slides/17-nonlinear-classifiers.html#alternative-using-deviance-loss-i-think-this-is-right",
+ "objectID": "schedule/slides/20-boosting.html#using-mobility-data-again",
+ "href": "schedule/slides/20-boosting.html#using-mobility-data-again",
"title": "UBC Stat406 2023W",
- "section": "Alternative (using deviance loss, I think this is right)",
- "text": "Alternative (using deviance loss, I think this is right)\n\n\nCode\ndev <- function(y, prob, prob_min = 1e-5) {\n y <- as.numeric(as.factor(y)) - 1 # 0/1 valued\n m <- mean(y)\n prob_max <- 1 - prob_min\n prob <- pmin(pmax(prob, prob_min), prob_max)\n lp <- (1 - y) * log(1 - prob) + y * log(prob)\n ly <- (1 - y) * log(1 - m) + y * log(m)\n 2 * (ly - lp)\n}\nknn.cv_probs <- function(train, cl, k = 1) {\n o <- knn.cv(train, cl, k = k, prob = TRUE)\n p <- attr(o, \"prob\")\n o <- as.numeric(as.factor(o)) - 1\n p[o == 0] <- 1 - p[o == 0]\n p\n}\ndev_err <- map_dbl(1:kmax, ~ mean(dev(dat1$y, knn.cv_probs(dat1[, -1], dat1$y, k = .x))))"
+ "section": "Using mobility data again",
+ "text": "Using mobility data again\n\n\nCode\nlibrary(kableExtra)\nlibrary(randomForest)\nmob <- Stat406::mobility |>\n mutate(mobile = as.factor(Mobility > .1)) |>\n select(-ID, -Name, -Mobility, -State) |>\n drop_na()\nn <- nrow(mob)\ntrainidx <- sample.int(n, floor(n * .75))\ntestidx <- setdiff(1:n, trainidx)\ntrain <- mob[trainidx, ]\ntest <- mob[testidx, ]\nrf <- randomForest(mobile ~ ., data = train)\nbag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)\npreds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))\n\n\n\n\nlibrary(gbm)\ntrain_boost <- train |>\n mutate(mobile = as.integer(mobile) - 1)\n# needs {0, 1} responses\ntest_boost <- test |>\n mutate(mobile = as.integer(mobile) - 1)\nadab <- gbm(\n mobile ~ .,\n data = train_boost,\n n.trees = 500,\n distribution = \"adaboost\"\n)\npreds$adab <- as.numeric(\n predict(adab, test_boost) > 0\n)\npar(mar = c(5, 11, 0, 1))\ns <- summary(adab, las = 1)"
},
{
- "objectID": "schedule/slides/17-nonlinear-classifiers.html#final-version",
- "href": "schedule/slides/17-nonlinear-classifiers.html#final-version",
+ "objectID": "schedule/slides/20-boosting.html#forward-stagewise-additive-modeling-fsam-completely-generic",
+ "href": "schedule/slides/20-boosting.html#forward-stagewise-additive-modeling-fsam-completely-generic",
"title": "UBC Stat406 2023W",
- "section": "Final version",
- "text": "Final version\n\n\n\n\nCode\nkopt <- max(which(err == min(err)))\nkopt <- kopt + 1 * (kopt %% 2 == 0)\ngr$opt <- knn(dat1[, -1], gr[, 1:2], dat1$y, k = kopt)\ntt <- table(knn(dat1[, -1], dat1[, -1], dat1$y, k = kopt), dat1$y, dnn = c(\"predicted\", \"truth\"))\nggplot(dat1, aes(x1, x2)) +\n theme_bw(base_size = 24) +\n scale_shape_manual(values = c(\"0\", \"1\"), guide = \"none\") +\n geom_raster(data = gr, aes(x1, x2, fill = opt), alpha = .6) +\n geom_point(aes(shape = y), size = 4) +\n coord_cartesian(c(-2.5, 3), c(-2.5, 3)) +\n scale_fill_manual(values = c(orange, green), labels = c(\"0\", \"1\")) +\n theme(\n legend.position = \"bottom\", legend.title = element_blank(),\n legend.key.width = unit(2, \"cm\")\n )\n\n\n\n\n\n\n\n\n\n\n\n\nBest \\(k\\): 19\nMisclassification error: 0.17\nConfusion matrix:\n\n\n\n truth\npredicted 1 2\n 1 41 6\n 2 11 42"
+ "section": "Forward stagewise additive modeling (FSAM, completely generic)",
+ "text": "Forward stagewise additive modeling (FSAM, completely generic)\nAlgorithm 🛠️\n\nSet initial predictor \\(f_0(x)=0\\)\nUntil we quit ( \\(m<M\\) iterations )\n\nCompute \\((\\beta_m, \\theta_m) = \\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i,\\ \\theta)\\right)\\)\nSet \\(f_m(x) = f_{m-1}(x) + \\beta_m G(x,\\ \\theta_m)\\)\n\nFinal classifier is \\(G(x, \\theta_M) = \\textrm{sign}\\left( f_M(x) \\right)\\)\n\nHere, \\(L\\) is a loss function that measures prediction accuracy\n\n\nIf (1) \\(L(y,\\ f(x))= \\exp(-y f(x))\\), (2) \\(G\\) is a classifier, and WLOG \\(y \\in \\{-1, 1\\}\\)\n\nFSAM is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000)."
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#meta-lecture",
- "href": "schedule/slides/19-bagging-and-rf.html#meta-lecture",
+ "objectID": "schedule/slides/20-boosting.html#so-what",
+ "href": "schedule/slides/20-boosting.html#so-what",
"title": "UBC Stat406 2023W",
- "section": "19 Bagging and random forests",
- "text": "19 Bagging and random forests\nStat 406\nDaniel J. McDonald\nLast modified – 11 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "So what?",
+ "text": "So what?\nIt turns out that “exponential loss” \\(L(y,\\ f(x))= \\exp(-y f(x))\\) is not very robust.\nHere are some other loss functions for 2-class classification\n\n\nWant losses which penalize negative margin, but not positive margins.\nRobust means don’t over-penalize large negatives"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#bagging",
- "href": "schedule/slides/19-bagging-and-rf.html#bagging",
+ "objectID": "schedule/slides/20-boosting.html#gradient-boosting",
+ "href": "schedule/slides/20-boosting.html#gradient-boosting",
"title": "UBC Stat406 2023W",
- "section": "Bagging",
- "text": "Bagging\nMany methods (trees, nonparametric smoothers) tend to have low bias but high variance.\nEspecially fully grown trees (that’s why we prune them)\n\nHigh-variance\n\nif we split the training data into two parts at random and fit a decision tree to each part, the results will be quite different.\n\nIn contrast, a low variance estimator\n\nwould yield similar results if applied to the two parts (consider \\(\\widehat{f} = 0\\)).\n\n\nBagging, short for bootstrap aggregation, is a general purpose procedure for reducing variance.\nWe’ll use it specifically in the context of trees, but it can be applied much more broadly."
+ "section": "Gradient boosting",
+ "text": "Gradient boosting\nIn the forward stagewise algorithm, we solved a minimization and then made an update:\n\\[f_m(x) = f_{m-1}(x) + \\beta_m G(x, \\theta_m)\\]\nFor most loss functions \\(L\\) / procedures \\(G\\) this optimization is difficult: \\[\\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i, \\theta)\\right)\\]\n💡 Just take one gradient step toward the minimum 💡\n\\[f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)\\]\nThis is called Gradient boosting\nNotice how similar the update steps look."
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-the-heuristic-motivation",
- "href": "schedule/slides/19-bagging-and-rf.html#bagging-the-heuristic-motivation",
+ "objectID": "schedule/slides/20-boosting.html#gradient-boosting-1",
+ "href": "schedule/slides/20-boosting.html#gradient-boosting-1",
"title": "UBC Stat406 2023W",
- "section": "Bagging: The heuristic motivation",
- "text": "Bagging: The heuristic motivation\nSuppose we have \\(n\\) uncorrelated observations \\(Z_1, \\ldots, Z_n\\), each with variance \\(\\sigma^2\\).\nWhat is the variance of\n\\[\\overline{Z} = \\frac{1}{n} \\sum_{i=1}^n Z_i\\ \\ \\ ?\\]\n\nSuppose we had \\(B\\) separate (uncorrelated) training sets, \\(1, \\ldots, B\\),\nWe can form \\(B\\) separate model fits, \\(\\widehat{f}^1(x), \\ldots, \\widehat{f}^B(x)\\), and then average them:\n\\[\\widehat{f}_{B}(x) = \\frac{1}{B} \\sum_{b=1}^B \\widehat{f}^b(x)\\]"
+ "section": "Gradient boosting",
+ "text": "Gradient boosting\n\\[f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)\\]\nGradient boosting goes only part of the way toward the minimum at each \\(m\\).\nThis has two advantages:\n\nSince we’re not fitting \\(\\beta, \\theta\\) to the data as “hard”, the learner is weaker.\nThis procedure is computationally much simpler.\n\nSimpler because we only require the gradient at one value, don’t have to fully optimize."
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-the-bootstrap-part",
- "href": "schedule/slides/19-bagging-and-rf.html#bagging-the-bootstrap-part",
+ "objectID": "schedule/slides/20-boosting.html#gradient-boosting-algorithm",
+ "href": "schedule/slides/20-boosting.html#gradient-boosting-algorithm",
"title": "UBC Stat406 2023W",
- "section": "Bagging: The bootstrap part",
- "text": "Bagging: The bootstrap part\n\nThis isn’t practical\n\nwe don’t have many training sets.\n\n\nWe therefore turn to the bootstrap to simulate having many training sets.\nSuppose we have data \\(Z_1, \\ldots, Z_n\\)\n\nChoose some large number of samples, \\(B\\).\nFor each \\(b = 1,\\ldots,B\\), resample from \\(Z_1, \\ldots, Z_n\\), call it \\(\\widetilde{Z}_1, \\ldots, \\widetilde{Z}_n\\).\nCompute \\(\\widehat{f}^b = \\widehat{f}(\\widetilde{Z}_1, \\ldots, \\widetilde{Z}_n)\\).\n\n\\[\\widehat{f}_{\\textrm{bag}}(x) = \\frac{1}{B} \\sum_{b=1}^B \\widehat{f}^b(x)\\]\nThis process is known as Bagging"
+ "section": "Gradient boosting – Algorithm 🛠️",
+ "text": "Gradient boosting – Algorithm 🛠️\n\nSet initial predictor \\(f_0(x)=\\overline{\\y}\\)\nUntil we quit ( \\(m<M\\) iterations )\n\nCompute pseudo-residuals (what is the gradient of \\(L(y,f)=(y-f(x))^2\\)?) \\[r_i = -\\frac{\\partial L(y_i,f(x_i))}{\\partial f(x_i)}\\bigg|_{f(x_i)=f_{m-1}(x_i)}\\]\nEstimate weak learner, \\(G(x, \\theta_m)\\), with the training set \\(\\{r_i, x_i\\}\\).\nFind the step size \\(\\gamma_m = \\argmin_\\gamma \\sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \\gamma G(x_i, \\theta_m))\\)\nSet \\(f_m(x) = f_{m-1}(x) + \\gamma_m G(x, \\theta_m)\\)\n\nFinal predictor is \\(f_M(x)\\)."
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-trees",
- "href": "schedule/slides/19-bagging-and-rf.html#bagging-trees",
+ "objectID": "schedule/slides/20-boosting.html#gradient-boosting-modifications",
+ "href": "schedule/slides/20-boosting.html#gradient-boosting-modifications",
"title": "UBC Stat406 2023W",
- "section": "Bagging trees",
- "text": "Bagging trees\n\n\n\n\n\nThe procedure for trees is the following\n\nChoose a large number \\(B\\).\nFor each \\(b = 1,\\ldots, B\\), grow an unpruned tree on the \\(b^{th}\\) bootstrap draw from the data.\nAverage all these trees together."
+ "section": "Gradient boosting modifications",
+ "text": "Gradient boosting modifications\n\ngrad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = \"bernoulli\")\n\n\nTypically done with “small” trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors)\nUsually modify the gradient step to \\(f_m(x) = f_{m-1}(x) + \\gamma_m \\alpha G(x,\\theta_m)\\) with \\(0<\\alpha<1\\). Helps to keep from fitting too hard.\nOften combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options.\nThere are many other extensions, notably XGBoost."
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-trees-1",
- "href": "schedule/slides/19-bagging-and-rf.html#bagging-trees-1",
+ "objectID": "schedule/slides/20-boosting.html#results-for-mobility",
+ "href": "schedule/slides/20-boosting.html#results-for-mobility",
"title": "UBC Stat406 2023W",
- "section": "Bagging trees",
- "text": "Bagging trees\n\n\n\n\n\nEach tree, since it is unpruned, will have\n\nlow / high variance\nlow / high bias\n\nTherefore averaging many trees results in an estimator that has\n\nlower / higher variance and\nlow / high bias."
+ "section": "Results for mobility",
+ "text": "Results for mobility\n\n\nCode\nlibrary(cowplot)\nboost_preds <- tibble(\n adaboost = predict(adab, test_boost),\n gbm = predict(grad_boost, test_boost),\n truth = test$mobile\n)\ng1 <- ggplot(boost_preds, aes(adaboost, gbm, color = as.factor(truth))) +\n geom_text(aes(label = as.integer(truth) - 1)) +\n geom_vline(xintercept = 0) +\n geom_hline(yintercept = 0) +\n xlab(\"adaboost margin\") +\n ylab(\"gbm margin\") +\n theme(legend.position = \"none\") +\n scale_color_manual(values = c(\"orange\", \"blue\")) +\n annotate(\"text\",\n x = -4, y = 5, color = red,\n label = paste(\n \"gbm error\\n\",\n round(with(boost_preds, mean((gbm > 0) != truth)), 2)\n )\n ) +\n annotate(\"text\",\n x = 4, y = -5, color = red,\n label = paste(\"adaboost error\\n\", round(with(boost_preds, mean((adaboost > 0) != truth)), 2))\n )\nboost_oob <- tibble(\n adaboost = adab$oobag.improve, gbm = grad_boost$oobag.improve,\n ntrees = 1:500\n)\ng2 <- boost_oob %>%\n pivot_longer(-ntrees, values_to = \"OOB_Error\") %>%\n ggplot(aes(x = ntrees, y = OOB_Error, color = name)) +\n geom_line() +\n scale_color_manual(values = c(orange, blue)) +\n theme(legend.title = element_blank())\nplot_grid(g1, g2, rel_widths = c(.4, .6))"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#bagging-trees-variable-importance-measures",
- "href": "schedule/slides/19-bagging-and-rf.html#bagging-trees-variable-importance-measures",
+ "objectID": "schedule/slides/20-boosting.html#major-takeaways",
+ "href": "schedule/slides/20-boosting.html#major-takeaways",
"title": "UBC Stat406 2023W",
- "section": "Bagging trees: Variable importance measures",
- "text": "Bagging trees: Variable importance measures\nBagging can dramatically improve predictive performance of trees\nBut we sacrificed some interpretability.\nWe no longer have that nice diagram that shows the segmentation of the predictor space\n(more accurately, we have \\(B\\) of them).\nTo recover some information, we can do the following:\n\nFor each of the \\(b\\) trees and each of the \\(p\\) variables, we record the amount that the Gini index is reduced by the addition of that variable\nReport the average reduction over all \\(B\\) trees."
+ "section": "Major takeaways",
+ "text": "Major takeaways\n\nTwo flavours of Boosting\n\nAdaBoost (the original) and\ngradient boosting (easier and more computationally friendly)\n\nThe connection is “Forward stagewise additive modelling” (AdaBoost is a special case)\nThe connection reveals that AdaBoost “isn’t robust because it uses exponential loss” (squared error is even worse)\nGradient boosting is a computationally easier version of FSAM\nAll use weak learners (compare to Bagging)\nThink about the Bias-Variance implications\nYou can use these for regression or classification\nYou can do this with other weak learners besides trees."
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#random-forest",
- "href": "schedule/slides/19-bagging-and-rf.html#random-forest",
+ "objectID": "schedule/slides/22-nnets-estimation.html#meta-lecture",
+ "href": "schedule/slides/22-nnets-estimation.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Random Forest",
- "text": "Random Forest\nRandom Forest is an extension of Bagging, in which the bootstrap trees are decorrelated.\nRemember: \\(\\Var{\\overline{Z}} = \\frac{1}{n}\\Var{Z_1}\\) unless the \\(Z_i\\)’s are correlated\nSo Bagging may not reduce the variance that much because the training sets are correlated across trees.\n\nHow do we decorrelate?\nDraw a bootstrap sample and start to build a tree.\n\nBut\n\nBefore we split, we randomly pick\n\n\n\\(m\\) of the possible \\(p\\) predictors as candidates for the split."
+ "section": "22 Neural nets - estimation",
+ "text": "22 Neural nets - estimation\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#decorrelating",
- "href": "schedule/slides/19-bagging-and-rf.html#decorrelating",
+ "objectID": "schedule/slides/22-nnets-estimation.html#neural-network-terms-again-t-hidden-layers-regression",
+ "href": "schedule/slides/22-nnets-estimation.html#neural-network-terms-again-t-hidden-layers-regression",
"title": "UBC Stat406 2023W",
- "section": "Decorrelating",
- "text": "Decorrelating\nA new sample of size \\(m\\) of the predictors is taken at each split.\nUsually, we use about \\(m = \\sqrt{p}\\)\nIn other words, at each split, we aren’t even allowed to consider the majority of possible predictors!"
+ "section": "Neural Network terms again (T hidden layers, regression)",
+ "text": "Neural Network terms again (T hidden layers, regression)\n\n\n\\[\n\\begin{aligned}\nA_{k}^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_{\\ell}^{(t)} &= g\\left(\\sum_{k=1}^{K_{t-1}} w^{(t)}_{\\ell,k} A_{k}^{(t-1)} \\right)\\\\\n\\hat{Y} &= z_m = \\sum_{\\ell=1}^{K_T} \\beta_{m,\\ell} A_{\\ell}^{(T)}\\ \\ (M = 1)\n\\end{aligned}\n\\]\n\n\\(B \\in \\R^{M\\times K_T}\\).\n\\(M=1\\) for regression\n\n\\(\\mathbf{W}_t \\in \\R^{K_2\\times K_1}\\) \\(t=1,\\ldots,T\\)"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#what-is-going-on-here",
- "href": "schedule/slides/19-bagging-and-rf.html#what-is-going-on-here",
+ "objectID": "schedule/slides/22-nnets-estimation.html#training-neural-networks.-first-choices",
+ "href": "schedule/slides/22-nnets-estimation.html#training-neural-networks.-first-choices",
"title": "UBC Stat406 2023W",
- "section": "What is going on here?",
- "text": "What is going on here?\nSuppose there is 1 really strong predictor and many mediocre ones.\n\nThen each tree will have this one predictor in it,\nTherefore, each tree will look very similar (i.e. highly correlated).\nAveraging highly correlated things leads to much less variance reduction than if they were uncorrelated.\n\nIf we don’t allow some trees/splits to use this important variable, each of the trees will be much less similar and hence much less correlated.\nBagging Trees is Random Forest when \\(m = p\\), that is, when we can consider all the variables at each split."
+ "section": "Training neural networks. First, choices",
+ "text": "Training neural networks. First, choices\n\nChoose the architecture: how many layers, units per layer, what connections?\nChoose the loss: common choices (for each data point \\(i\\))\n\n\nRegression\n\n\\(\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2\\) (the 1/2 just makes the derivative nice)\n\nClassification\n\n\\(\\hat{R}_i = I(y_i = m)\\log( 1 + \\exp(-z_{im}))\\)\n\n\n\nChoose the activation function \\(g\\)"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data",
- "href": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data",
+ "objectID": "schedule/slides/22-nnets-estimation.html#training-neural-networks-intuition",
+ "href": "schedule/slides/22-nnets-estimation.html#training-neural-networks-intuition",
"title": "UBC Stat406 2023W",
- "section": "Example with Mobility data",
- "text": "Example with Mobility data\n\nlibrary(randomForest)\nlibrary(kableExtra)\nset.seed(406406)\nmob <- Stat406::mobility |>\n mutate(mobile = as.factor(Mobility > .1)) |>\n select(-ID, -Name, -Mobility, -State) |>\n drop_na()\nn <- nrow(mob)\ntrainidx <- sample.int(n, floor(n * .75))\ntestidx <- setdiff(1:n, trainidx)\ntrain <- mob[trainidx, ]\ntest <- mob[testidx, ]\nrf <- randomForest(mobile ~ ., data = train)\nbag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)\npreds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))\n\nkbl(cbind(table(preds$truth, preds$rf), table(preds$truth, preds$bag))) |>\n add_header_above(c(\"Truth\" = 1, \"RF\" = 2, \"Bagging\" = 2))\n\n\n\n\n\n\n\n\n\n\n\n\nTruth\n\n\nRF\n\n\nBagging\n\n\n\n\nFALSE\nTRUE\nFALSE\nTRUE\n\n\n\n\nFALSE\n61\n10\n60\n11\n\n\nTRUE\n12\n22\n10\n24"
+ "section": "Training neural networks (intuition)",
+ "text": "Training neural networks (intuition)\n\nWe need to estimate \\(B\\), \\(\\mathbf{W}_t\\), \\(t=1,\\ldots,T\\)\nWe want to minimize \\(\\hat{R} = \\sum_{i=1}^n \\hat{R}_i\\) as a function of all this.\nWe use gradient descent, but in this dialect, we call it back propagation\n\n\n\nDerivatives via the chain rule: computed by a forward and backward sweep\nAll the \\(g(u)\\)’s that get used have \\(g'(u)\\) “nice”.\nIf \\(g\\) is ReLu:\n\n\\(g(u) = xI(x>0)\\)\n\\(g'(u) = I(x>0)\\)\n\n\n\nOnce we have derivatives from backprop,\n\\[\n\\begin{align}\n\\widetilde{B} &\\leftarrow B - \\gamma \\frac{\\partial \\widehat{R}}{\\partial B}\\\\\n\\widetilde{\\mathbf{W}_t} &\\leftarrow \\mathbf{W}_t - \\gamma \\frac{\\partial \\widehat{R}}{\\partial \\mathbf{W}_t}\n\\end{align}\n\\]"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data-1",
- "href": "schedule/slides/19-bagging-and-rf.html#example-with-mobility-data-1",
+ "objectID": "schedule/slides/22-nnets-estimation.html#chain-rule",
+ "href": "schedule/slides/22-nnets-estimation.html#chain-rule",
"title": "UBC Stat406 2023W",
- "section": "Example with Mobility data",
- "text": "Example with Mobility data\n\nvarImpPlot(rf, pch = 16, col = orange)"
+ "section": "Chain rule",
+ "text": "Chain rule\nWe want \\(\\frac{\\partial}{\\partial B} \\hat{R}_i\\) and \\(\\frac{\\partial}{\\partial W_{t}}\\hat{R}_i\\) for all \\(t\\).\nRegression: \\(\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2\\)\n\\[\\begin{aligned}\n\\frac{\\partial\\hat{R}_i}{\\partial B} &= -(y_i - \\hat{y}_i)\\frac{\\partial \\hat{y_i}}{\\partial B} =\\underbrace{-(y_i - \\hat{y}_i)}_{-r_i} \\mathbf{A}^{(T)}\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_T} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_T} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_T}\\\\\n&= -\\left(r_i B \\odot g'(\\mathbf{W}_T \\mathbf{A}^{(T)}) \\right) \\left(\\mathbf{A}^{(T-1)}\\right)^\\top\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_{T-1}} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_{T-1}} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n&= -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T}}\\frac{\\partial \\mathbf{W}_{T}}{\\partial \\mathbf{A}^{(T-1)}}\\frac{\\partial \\mathbf{A}^{(T-1)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n\\cdots &= \\cdots\n\\end{aligned}\\]"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#one-last-thing",
- "href": "schedule/slides/19-bagging-and-rf.html#one-last-thing",
+ "objectID": "schedule/slides/22-nnets-estimation.html#mapping-it-out",
+ "href": "schedule/slides/22-nnets-estimation.html#mapping-it-out",
"title": "UBC Stat406 2023W",
- "section": "One last thing…",
- "text": "One last thing…\n\nOn average\n\ndrawing \\(n\\) samples from \\(n\\) observations with replacement (bootstrapping) results in ~ 2/3 of the observations being selected. (Can you show this?)\n\n\nThe remaining ~ 1/3 of the observations are not used on that tree.\nThese are referred to as out-of-bag (OOB).\nWe can think of it as a for-free cross-validation.\nEach time a tree is grown, we get its prediction error on the unused observations.\nWe average this over all bootstrap samples."
+ "section": "Mapping it out",
+ "text": "Mapping it out\nGiven current \\(\\mathbf{W}_t, B\\), we want to get new, \\(\\widetilde{\\mathbf{W}}_t,\\ \\widetilde B\\) for \\(t=1,\\ldots,T\\)\n\nSquared error for regression, cross-entropy for classification\n\n\n\nFeed forward \n\\[\\mathbf{A}^{(0)} = \\mathbf{X} \\in \\R^{n\\times p}\\]\nRepeat, \\(t= 1,\\ldots, T\\)\n\n\\(\\mathbf{Z}_{t} = \\mathbf{A}^{(t-1)}\\mathbf{W}_t \\in \\R^{n\\times K_t}\\)\n\\(\\mathbf{A}^{(t)} = g(\\mathbf{Z}_{t})\\) (component wise)\n\\(\\dot{\\mathbf{A}}^{(t)} = g'(\\mathbf{Z}_t)\\)\n\n\\[\\begin{cases}\n\\hat{\\mathbf{y}} =\\mathbf{A}^{(T)} B \\in \\R^n \\\\\n\\hat{\\Pi} = \\left(1 + \\exp\\left(-\\mathbf{A}^{(T)}\\mathbf{B}\\right)\\right)^{-1} \\in \\R^{n \\times M}\\end{cases}\\]\n\n\nBack propogate \n\\[r = \\begin{cases}\n-\\left(\\mathbf{y} - \\widehat{\\mathbf{y}}\\right) \\\\\n-\\left(1 - \\widehat{\\Pi}\\right)[y]\\end{cases}\\]\n\\[\n\\begin{aligned}\n\\frac{\\partial}{\\partial \\mathbf{B}} \\widehat{R} &= \\left(\\mathbf{A}^{(T)}\\right)^\\top \\mathbf{r}\\\\\n\\boldsymbol{\\Gamma} &\\leftarrow \\mathbf{r}\\\\\n\\mathbf{W}_{T+1} &\\leftarrow \\mathbf{B}\n\\end{aligned}\n\\]\nRepeat, \\(t = T,...,1\\),\n\n\\(\\boldsymbol{\\Gamma} \\leftarrow \\left(\\boldsymbol{\\Gamma} \\mathbf{W}_{t+1}\\right) \\odot\\dot{\\mathbf{A}}^{(t)}\\)\n\\(\\frac{\\partial R}{\\partial \\mathbf{W}_t} =\\left(\\mathbf{A}^{(t)}\\right)^\\top \\Gamma\\)"
},
{
- "objectID": "schedule/slides/19-bagging-and-rf.html#out-of-bag-error-estimation-for-bagging-rf",
- "href": "schedule/slides/19-bagging-and-rf.html#out-of-bag-error-estimation-for-bagging-rf",
+ "objectID": "schedule/slides/22-nnets-estimation.html#deep-nets",
+ "href": "schedule/slides/22-nnets-estimation.html#deep-nets",
"title": "UBC Stat406 2023W",
- "section": "Out-of-bag error estimation for bagging / RF",
- "text": "Out-of-bag error estimation for bagging / RF\nFor randomForest(), predict() without passing newdata = gives the OOB prediction\nnot like lm() where it gives the fitted values\n\ntab <- table(predict(bag), train$mobile) \nkbl(tab) |> add_header_above(c(\"Truth\" = 1, \"Bagging\" = 2))\n\n\n\n\n\n\n\n\n\n\nTruth\n\n\nBagging\n\n\n\n\nFALSE\nTRUE\n\n\n\n\nFALSE\n182\n28\n\n\nTRUE\n21\n82\n\n\n\n\n\n\n1 - sum(diag(tab)) / sum(tab) ## OOB misclassification error, no need for CV\n\n[1] 0.1565495"
+ "section": "Deep nets",
+ "text": "Deep nets\nSome comments on adding layers:\n\nIt has been shown that one hidden layer is sufficient to approximate any bounded piecewise continuous function\nHowever, this may take a huge number of hidden units (i.e. \\(K_1 \\gg 1\\)).\nThis is what people mean when they say that NNets are “universal approximators”\nBy including multiple layers, we can have fewer hidden units per layer.\nAlso, we can encode (in)dependencies that can speed computations\nWe don’t have to connect everything the way we have been"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#meta-lecture",
- "href": "schedule/slides/21-nnets-intro.html#meta-lecture",
+ "objectID": "schedule/slides/22-nnets-estimation.html#simple-example",
+ "href": "schedule/slides/22-nnets-estimation.html#simple-example",
"title": "UBC Stat406 2023W",
- "section": "21 Neural nets",
- "text": "21 Neural nets\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "Simple example",
+ "text": "Simple example\n\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestdata <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nnn_out <- neuralnet(y ~ x, data = df, hidden = c(10, 5, 15), threshold = 0.01, rep = 3)\nnn_preds <- map(1:3, ~ compute(nn_out, testdata, .x)$net.result)\nyhat <- nn_preds |> bind_cols() |> rowMeans() # average over the runs\n\n\n\nCode\n# This code will reproduce the analysis, takes some time\nset.seed(406406406)\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestx <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nlibrary(splines)\nfstar <- sin(1 / testx)\nspline_test_err <- function(k) {\n fit <- lm(y ~ bs(x, df = k), data = df)\n yhat <- predict(fit, newdata = tibble(x = testx))\n mean((yhat - fstar)^2)\n}\nKs <- 1:15 * 10\nSplineErr <- map_dbl(Ks, ~ spline_test_err(.x))\n\nJgrid <- c(5, 10, 15)\nNNerr <- double(length(Jgrid)^3)\nNNplot <- character(length(Jgrid)^3)\nsweep <- 0\nfor (J1 in Jgrid) {\n for (J2 in Jgrid) {\n for (J3 in Jgrid) {\n sweep <- sweep + 1\n NNplot[sweep] <- paste(J1, J2, J3, sep = \" \")\n nn_out <- neuralnet(y ~ x, df,\n hidden = c(J1, J2, J3),\n threshold = 0.01, rep = 3\n )\n nn_results <- sapply(1:3, function(x) {\n compute(nn_out, testx, x)$net.result\n })\n # Run them through the neural network\n Yhat <- rowMeans(nn_results)\n NNerr[sweep] <- mean((Yhat - fstar)^2)\n }\n }\n}\n\nbestK <- Ks[which.min(SplineErr)]\nbestspline <- predict(lm(y ~ bs(x, bestK), data = df), newdata = tibble(x = testx))\nbesthidden <- as.numeric(unlist(strsplit(NNplot[which.min(NNerr)], \" \")))\nnn_out <- neuralnet(y ~ x, df, hidden = besthidden, threshold = 0.01, rep = 3)\nnn_results <- sapply(1:3, function(x) compute(nn_out, testdata, x)$net.result)\n# Run them through the neural network\nbestnn <- rowMeans(nn_results)\nplotd <- data.frame(\n x = testdata, spline = bestspline, nnet = bestnn, truth = fstar\n)\nsave.image(file = \"data/nnet-example.Rdata\")"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#overview",
- "href": "schedule/slides/21-nnets-intro.html#overview",
+ "objectID": "schedule/slides/22-nnets-estimation.html#different-architectures",
+ "href": "schedule/slides/22-nnets-estimation.html#different-architectures",
"title": "UBC Stat406 2023W",
- "section": "Overview",
- "text": "Overview\nNeural networks are models for supervised learning\nLinear combinations of features are passed through a non-linear transformation in successive layers\nAt the top layer, the resulting latent factors are fed into an algorithm for predictions\n(Most commonly via least squares or logistic loss)"
+ "section": "Different architectures",
+ "text": "Different architectures"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#background",
- "href": "schedule/slides/21-nnets-intro.html#background",
+ "objectID": "schedule/slides/24-pca-intro.html#meta-lecture",
+ "href": "schedule/slides/24-pca-intro.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Background",
- "text": "Background\n\n\nNeural networks have come about in 3 “waves”\nThe first was an attempt in the 1950s to model the mechanics of the human brain\nIt appeared the brain worked by\n\ntaking atomic units known as neurons, which can be “on” or “off”\nputting them in networks\n\nA neuron itself interprets the status of other neurons\nThere weren’t really computers, so we couldn’t estimate these things"
+ "section": "24 Principal components, introduction",
+ "text": "24 Principal components, introduction\nStat 406\nDaniel J. McDonald\nLast modified – 01 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#background-1",
- "href": "schedule/slides/21-nnets-intro.html#background-1",
+ "objectID": "schedule/slides/24-pca-intro.html#unsupervised-learning",
+ "href": "schedule/slides/24-pca-intro.html#unsupervised-learning",
"title": "UBC Stat406 2023W",
- "section": "Background",
- "text": "Background\nAfter the development of parallel, distributed computation in the 1980s, this “artificial intelligence” view was diminished\nAnd neural networks gained popularity\nBut, the growing popularity of SVMs and boosting/bagging in the late 1990s, neural networks again fell out of favor\nThis was due to many of the problems we’ll discuss (non-convexity being the main one)\n\nIn the mid 2000’s, new approaches for initializing neural networks became available\nThese approaches are collectively known as deep learning\nState-of-the-art performance on various classification tasks has been accomplished via neural networks\nToday, Neural Networks/Deep Learning are the hottest…"
+ "section": "Unsupervised learning",
+ "text": "Unsupervised learning\nIn Machine Learning, rather than calling \\(\\y\\) the response, people call it the supervisor\nSo unsupervised learning means learning without \\(\\y\\)\nThe only data you get are the features \\(\\{x_1,\\ldots,x_n\\}\\).\nThis type of analysis is more often exploratory\nWe’re not necessarily using this for prediction (but we could)\nSo now, we get \\(\\X\\)\nThe two main activities are representation learning and clustering"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#high-level-overview",
- "href": "schedule/slides/21-nnets-intro.html#high-level-overview",
+ "objectID": "schedule/slides/24-pca-intro.html#representation-learning",
+ "href": "schedule/slides/24-pca-intro.html#representation-learning",
"title": "UBC Stat406 2023W",
- "section": "High level overview",
- "text": "High level overview"
+ "section": "Representation learning",
+ "text": "Representation learning\nRepresentation learning is the idea that performance of ML methods is highly dependent on the choice of representation\nFor this reason, much of ML is geared towards transforming the data into the relevant features and then using these as inputs\nThis idea is as old as statistics itself, really,\nHowever, the idea is constantly revisited in a variety of fields and contexts\nCommonly, these learned representations capture low-level information like overall shapes\nIt is possible to quantify this intuition for PCA at least\n\n\nGoal\n\nTransform \\(\\mathbf{X}\\in \\R^{n\\times p}\\) into \\(\\mathbf{Z} \\in \\R^{n \\times ?}\\)\n\n\n?-dimension can be bigger (feature creation) or smaller (dimension reduction) than \\(p\\)"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression",
- "href": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression",
+ "objectID": "schedule/slides/24-pca-intro.html#youve-done-this-already",
+ "href": "schedule/slides/24-pca-intro.html#youve-done-this-already",
"title": "UBC Stat406 2023W",
- "section": "Recall nonparametric regression",
- "text": "Recall nonparametric regression\nSuppose \\(Y \\in \\mathbb{R}\\) and we are trying estimate the regression function \\[\\Expect{Y\\given X} = f_*(X)\\]\nIn Module 2, we discussed basis expansion,\n\nWe know \\(f_*(x) =\\sum_{k=1}^\\infty \\beta_k h_k(x)\\) some basis \\(h_1,h_2,\\ldots\\) (using \\(h\\) instead of \\(\\phi\\) to match ISLR)\nTruncate this expansion at \\(K\\): \\(f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k h_k(x)\\)\nEstimate \\(\\beta_k\\) with least squares"
+ "section": "You’ve done this already!",
+ "text": "You’ve done this already!\n\nYou added transformations as predictors in regression\nYou “expanded” \\(\\mathbf{X}\\) using a basis \\(\\Phi\\) (polynomials, splines, etc.)\nYou used Neural Nets to do a “feature map”\n\n\nThis is the same, just no \\(Y\\) around"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression-1",
- "href": "schedule/slides/21-nnets-intro.html#recall-nonparametric-regression-1",
+ "objectID": "schedule/slides/24-pca-intro.html#pca",
+ "href": "schedule/slides/24-pca-intro.html#pca",
"title": "UBC Stat406 2023W",
- "section": "Recall nonparametric regression",
- "text": "Recall nonparametric regression\nThe weaknesses of this approach are:\n\nThe basis is fixed and independent of the data\nIf \\(p\\) is large, then nonparametrics doesn’t work well at all (recall the Curse of Dimensionality)\nIf the basis doesn’t “agree” with \\(f_*\\), then \\(K\\) will have to be large to capture the structure\nWhat if parts of \\(f_*\\) have substantially different structure? Say \\(f_*(x)\\) really wiggly for \\(x \\in [-1,3]\\) but smooth elsewhere\n\nAn alternative would be to have the data tell us what kind of basis to use (Module 5)"
+ "section": "PCA",
+ "text": "PCA\nPrincipal components analysis (PCA) is an (unsupervised) dimension reduction technique\nIt solves various equivalent optimization problems\n(Maximize variance, minimize \\(\\ell_2\\) distortions, find closest subspace of a given rank, \\(\\ldots\\))\nAt its core, we are finding linear combinations of the original (centered) covariates \\[z_{ij} = \\alpha_j^{\\top} x_i\\]\nThis is expressed via the SVD: \\(\\X = \\U\\D\\V^{\\top}\\).\n\n\n\n\n\n\n\nImportant\n\n\nWe assume throughout that \\(\\X - \\mathbf{11^\\top}\\overline{x} = 0\\) (we center the columns)\n\n\n\nThen our new features are\n\\[\\mathbf{Z} = \\X \\V = \\U\\D\\]"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#layer-for-regression",
- "href": "schedule/slides/21-nnets-intro.html#layer-for-regression",
+ "objectID": "schedule/slides/24-pca-intro.html#short-svd-aside-reminder-from-ridge-regression",
+ "href": "schedule/slides/24-pca-intro.html#short-svd-aside-reminder-from-ridge-regression",
"title": "UBC Stat406 2023W",
- "section": "1-layer for Regression",
- "text": "1-layer for Regression\n\n\nA single layer neural network model is \\[\n\\begin{aligned}\n&f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ g(w_{k0} + w_k^{\\top}x)\\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n\\]\nCompare: A nonparametric regression \\[f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k {\\phi_k(x)}\\]"
+ "section": "Short SVD aside (reminder from Ridge Regression)",
+ "text": "Short SVD aside (reminder from Ridge Regression)\n\nAny \\(n\\times p\\) matrix can be decomposed into \\(\\mathbf{UDV}^\\top\\).\nThese have properties:\n\n\n\\(\\mathbf{U}^\\top \\mathbf{U} = \\mathbf{I}_n\\)\n\\(\\mathbf{V}^\\top \\mathbf{V} = \\mathbf{I}_p\\)\n\\(\\mathbf{D}\\) is diagonal (0 off the diagonal)\n\nAlmost all the methods for we’ll talk about for representation learning use the SVD of some matrix."
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#terminology",
- "href": "schedule/slides/21-nnets-intro.html#terminology",
+ "objectID": "schedule/slides/24-pca-intro.html#why",
+ "href": "schedule/slides/24-pca-intro.html#why",
"title": "UBC Stat406 2023W",
- "section": "Terminology",
- "text": "Terminology\n\\[f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}\\] The main components are\n\nThe derived features \\({A_k = g(w_{k0} + w_k^{\\top}x)}\\) and are called the hidden units or activations\nThe function \\(g\\) is called the activation function (more on this later)\nThe parameters \\({\\beta_0},{\\beta_k},{w_{k0}},{w_k}\\) are estimated from the data for all \\(k = 1,\\ldots, K\\).\nThe number of hidden units \\({K}\\) is a tuning parameter\n\\(\\beta_0\\) and \\(w_{k0}\\) are usually called biases (I’m going to set them to 0 and ignore them in future formulas. Just for space. It’s just an intercept)"
+ "section": "Why?",
+ "text": "Why?\n\nGiven \\(\\X\\), find a projection \\(\\mathbf{P}\\) onto \\(\\R^M\\) with \\(M \\leq p\\) that minimizes the reconstruction error \\[\n\\begin{aligned}\n\\min_{\\mathbf{P}} &\\,\\, \\lVert \\mathbf{X} - \\mathbf{X}\\mathbf{P} \\rVert^2_F \\,\\,\\, \\textrm{(sum all the elements)}\\\\\n\\textrm{subject to} &\\,\\, \\textrm{rank}(\\mathbf{P}) = M,\\, \\mathbf{P} = \\mathbf{P}^T,\\, \\mathbf{P} = \\mathbf{P}^2\n\\end{aligned}\n\\] The conditions ensure that \\(\\mathbf{P}\\) is a projection matrix onto \\(M\\) dimensions.\nMaximize the variance explained by an orthogonal transformation \\(\\mathbf{A} \\in \\R^{p\\times M}\\) \\[\n\\begin{aligned}\n\\max_{\\mathbf{A}} &\\,\\, \\textrm{trace}\\left(\\frac{1}{n}\\mathbf{A}^\\top \\X^\\top \\X \\mathbf{A}\\right)\\\\\n\\textrm{subject to} &\\,\\, \\mathbf{A}^\\top\\mathbf{A} = \\mathbf{I}_M\n\\end{aligned}\n\\]\n\n\nIn case one, the minimizer is \\(\\mathbf{P} = \\mathbf{V}_M\\mathbf{V}_M^\\top\\)\nIn case two, the maximizer is \\(\\mathbf{A} = \\mathbf{V}_M\\)."
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#terminology-1",
- "href": "schedule/slides/21-nnets-intro.html#terminology-1",
+ "objectID": "schedule/slides/24-pca-intro.html#lower-dimensional-embeddings",
+ "href": "schedule/slides/24-pca-intro.html#lower-dimensional-embeddings",
"title": "UBC Stat406 2023W",
- "section": "Terminology",
- "text": "Terminology\n\\[f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}\\]\nNotes (no biases):\n\\(\\beta \\in \\R^k\\).\n\\(w_k \\in \\R^p,\\ k = 1,\\ldots,K\\)\n\\(\\mathbf{W} \\in \\R^{K\\times p}\\)"
+ "section": "Lower dimensional embeddings",
+ "text": "Lower dimensional embeddings\nSuppose we have predictors \\(\\x_1\\) and \\(\\x_2\\)\n\nWe more faithfully preserve the structure of this data by keeping \\(\\x_1\\) and setting \\(\\x_2\\) to zero than the opposite"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers",
- "href": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers",
+ "objectID": "schedule/slides/24-pca-intro.html#lower-dimensional-embeddings-1",
+ "href": "schedule/slides/24-pca-intro.html#lower-dimensional-embeddings-1",
"title": "UBC Stat406 2023W",
- "section": "What about classification (10 classes, 2 layers)",
- "text": "What about classification (10 classes, 2 layers)\n\n\n\\[\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n\\]\n\n\n\n\n\n\n\n\n\nPredict class with largest probability: \\(\\hat{Y} = \\argmax_{m} f_m(x)\\)"
+ "section": "Lower dimensional embeddings",
+ "text": "Lower dimensional embeddings\nAn important feature of the previous example is that \\(\\x_1\\) and \\(\\x_2\\) aren’t correlated\nWhat if they are?\n\nWe lose a lot of structure by setting either \\(\\x_1\\) or \\(\\x_2\\) to zero"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers-1",
- "href": "schedule/slides/21-nnets-intro.html#what-about-classification-10-classes-2-layers-1",
+ "objectID": "schedule/slides/24-pca-intro.html#lower-dimensional-embeddings-2",
+ "href": "schedule/slides/24-pca-intro.html#lower-dimensional-embeddings-2",
"title": "UBC Stat406 2023W",
- "section": "What about classification (10 classes, 2 layers)",
- "text": "What about classification (10 classes, 2 layers)\n\n\nNotes:\n\\(B \\in \\R^{M\\times K_2}\\) (here \\(M=10\\)).\n\\(\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}\\)\n\\(\\mathbf{W}_1 \\in \\R^{K_1\\times p}\\)"
+ "section": "Lower dimensional embeddings",
+ "text": "Lower dimensional embeddings\nThe only difference is the first is a rotation of the second"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#two-observations",
- "href": "schedule/slides/21-nnets-intro.html#two-observations",
+ "objectID": "schedule/slides/24-pca-intro.html#pca-1",
+ "href": "schedule/slides/24-pca-intro.html#pca-1",
"title": "UBC Stat406 2023W",
- "section": "Two observations",
- "text": "Two observations\n\nThe \\(g\\) function generates a feature map\n\nWe start with \\(p\\) covariates and we generate \\(K\\) features (1-layer)\n\n\nLogistic / Least-squares with a polynomial transformation\n\\[\n\\begin{aligned}\n&\\Phi(x) \\\\\n& =\n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &= \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n\\]\n\n\nNeural network\n\\[\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\\n&= \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}\\]"
+ "section": "PCA",
+ "text": "PCA\nIf we knew how to rotate our data, then we could more easily retain the structure.\nPCA gives us exactly this rotation\n\nCenter (+scale?) the data matrix \\(\\X\\)\nCompute the SVD of \\(\\X = \\U\\D \\V^\\top\\) or \\(\\X\\X^\\top = \\U\\D^2\\U^\\top\\) or \\(\\X^\\top \\X = \\V\\D^2 \\V^\\top\\)\nReturn \\(\\U_M\\D_M\\), where \\(\\D_M\\) is the largest \\(M\\) eigenvalues of \\(\\X\\)"
},
{
- "objectID": "schedule/slides/21-nnets-intro.html#two-observations-1",
- "href": "schedule/slides/21-nnets-intro.html#two-observations-1",
+ "objectID": "schedule/slides/24-pca-intro.html#pca-2",
+ "href": "schedule/slides/24-pca-intro.html#pca-2",
"title": "UBC Stat406 2023W",
- "section": "Two observations",
- "text": "Two observations\n\nIf \\(g(u) = u\\), (or \\(=3u\\)) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n\nReLU is the current fashion (used to be tanh or logistic)"
+ "section": "PCA",
+ "text": "PCA\n\n\nCode\ns <- svd(X)\ntib <- rbind(X, s$u %*% diag(s$d), s$u %*% diag(c(s$d[1], 0)))\ntib <- tibble(\n x1 = tib[,1], x2 = tib[,2], \n name = rep(1:3, each = 20)\n)\nplotter <- function(set = 1, main = \"original\") {\n tib |>\n filter(name == set) |>\n ggplot(aes(x1, x2)) +\n geom_point(colour = blue) +\n coord_cartesian(c(-2, 2), c(-2, 2)) +\n theme(legend.title = element_blank(), legend.position = \"bottom\") +\n ggtitle(main)\n}\ncowplot::plot_grid(\n plotter() + labs(x = bquote(x[1]), y = bquote(x[2])), \n plotter(2, \"rotated\") + \n labs(x = bquote((UD)[1] == (XV)[1]), y = bquote((UD)[2] == (XV)[2])),\n plotter(3, \"rotated and projected\") + \n labs(x = bquote(U[1]~D[1] == (XV)[1]), y = bquote(U[2]~D[2] %==% 0)),\n nrow = 1\n)"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#meta-lecture",
- "href": "schedule/slides/23-nnets-other.html#meta-lecture",
+ "objectID": "schedule/slides/24-pca-intro.html#pca-on-some-pop-music-data",
+ "href": "schedule/slides/24-pca-intro.html#pca-on-some-pop-music-data",
"title": "UBC Stat406 2023W",
- "section": "23 Neural nets - other considerations",
- "text": "23 Neural nets - other considerations\nStat 406\nDaniel J. McDonald\nLast modified – 12 October 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\]"
+ "section": "PCA on some pop music data",
+ "text": "PCA on some pop music data\n\nmusic <- bind_rows(Stat406::popmusic_test, Stat406::popmusic_train)\nstr(music)\n\ntibble [1,694 × 15] (S3: tbl_df/tbl/data.frame)\n $ artist : Factor w/ 3 levels \"Radiohead\",\"Taylor Swift\",..: 2 2 2 2 2 2 2 2 2 2 ...\n $ danceability : num [1:1694] 0.694 0.7 0.334 0.483 0.65 0.722 0.654 0.696 0.571 0.681 ...\n $ energy : num [1:1694] 0.38 0.55 0.161 0.84 0.404 0.494 0.638 0.485 0.477 0.396 ...\n $ key : int [1:1694] 2 7 0 7 7 7 8 7 11 0 ...\n $ loudness : num [1:1694] -10.31 -9.13 -14.88 -6.51 -8.4 ...\n $ mode : int [1:1694] 1 1 1 1 1 1 1 1 0 1 ...\n $ speechiness : num [1:1694] 0.0614 0.0653 0.0506 0.119 0.0356 0.204 0.075 0.123 0.246 0.0487 ...\n $ acousticness : num [1:1694] 0.416 0.0661 0.967 0.43 0.0616 0.216 0.0727 0.103 0.313 0.487 ...\n $ instrumentalness: num [1:1694] 8.47e-06 1.02e-04 4.71e-05 5.75e-04 0.00 0.00 0.00 6.69e-05 0.00 1.38e-03 ...\n $ liveness : num [1:1694] 0.126 0.091 0.115 0.146 0.104 0.226 0.497 0.136 0.11 0.117 ...\n $ valence : num [1:1694] 0.376 0.412 0.396 0.55 0.0379 0.131 0.0941 0.339 0.331 0.157 ...\n $ tempo : num [1:1694] 120 164 177 158 108 ...\n $ time_signature : int [1:1694] 4 4 4 4 4 4 4 4 4 4 ...\n $ duration_ms : int [1:1694] 194206 194165 188496 260361 218270 210556 204852 196258 148781 225194 ...\n $ explicit : logi [1:1694] FALSE FALSE FALSE FALSE FALSE FALSE ...\n\nX <- music |> select(danceability:valence)\npca <- prcomp(X, scale = TRUE) ## DON'T USE princomp()"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#estimation-procedures-training",
- "href": "schedule/slides/23-nnets-other.html#estimation-procedures-training",
+ "objectID": "schedule/slides/24-pca-intro.html#pca-on-some-pop-music-data-1",
+ "href": "schedule/slides/24-pca-intro.html#pca-on-some-pop-music-data-1",
"title": "UBC Stat406 2023W",
- "section": "Estimation procedures (training)",
- "text": "Estimation procedures (training)\nBack-propagation\nAdvantages:\n\nIt’s updates only depend on local information in the sense that if objects in the hierarchical model are unrelated to each other, the updates aren’t affected\n(This helps in many ways, most notably in parallel architectures)\nIt doesn’t require second-derivative information\nAs the updates are only in terms of \\(\\hat{R}_i\\), the algorithm can be run in either batch or online mode\n\nDown sides:\n\nIt can be very slow\nNeed to choose the learning rate \\(\\gamma_t\\)"
+ "section": "PCA on some pop music data",
+ "text": "PCA on some pop music data"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#other-algorithms",
- "href": "schedule/slides/23-nnets-other.html#other-algorithms",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#meta-lecture",
+ "href": "schedule/slides/26-pca-v-kpca.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "Other algorithms",
- "text": "Other algorithms\nThere are many variations on the fitting algorithm\nStochastic gradient descent: (SGD) discussed in the optimization lecture\nThe rest are variations that use lots of tricks\n\nRMSprop\nAdam\nAdadelta\nAdagrad\nAdamax\nNadam\nFtrl"
+ "section": "26 PCA v KPCA",
+ "text": "26 PCA v KPCA\nStat 406\nDaniel J. McDonald\nLast modified – 01 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#regularizing-neural-networks",
- "href": "schedule/slides/23-nnets-other.html#regularizing-neural-networks",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#pca-v-kpca",
+ "href": "schedule/slides/26-pca-v-kpca.html#pca-v-kpca",
"title": "UBC Stat406 2023W",
- "section": "Regularizing neural networks",
- "text": "Regularizing neural networks\nNNets can almost always achieve 0 training error. Even with regularization. Because they have so many parameters.\nFlavours:\n\na complexity penalization term \\(\\longrightarrow\\) solve \\(\\min \\hat{R} + \\rho(\\alpha,\\beta)\\)\nearly stopping on the back propagation algorithm used for fitting\n\n\nWeight decay\n\nThis is like ridge regression in that we penalize the squared Euclidean norm of the weights \\(\\rho(\\mathbf{W},\\mathbf{B}) = \\sum w_i^2 + \\sum b_i^2\\)\n\nWeight elimination\n\nThis encourages more shrinking of small weights \\(\\rho(\\mathbf{W},\\mathbf{B}) = \\sum \\frac{w_i^2}{1+w_i^2} + \\sum \\frac{b_i^2}{1 + b_i^2}\\) or Lasso-type\n\nDropout\n\nIn each epoch, randomly choose \\(z\\%\\) of the nodes and set those weights to zero."
+ "section": "PCA v KPCA",
+ "text": "PCA v KPCA\n(We assume \\(\\X\\) is already centered/scaled, \\(n\\) rows, \\(p\\) columns)\n\n\nPCA:\n\nStart with data.\nDecompose \\(\\X=\\U\\D\\V^\\top\\) (SVD).\nEmbed into \\(M\\leq p\\) dimensions: \\[\\U_M \\D_M = \\X\\V_M\\]\n\nThe “embedding” is \\(\\U_M \\D_M\\).\n(called the “Principal Components” or the “scores” or occasionally the “factors”)\nThe “loadings” or “weights” are \\(\\V_M\\)\n\n\nKPCA:\n\nChoose \\(k(x_i, x_{i'})\\). Create \\(\\mathbf{K}\\).\nDouble center \\(\\mathbf{K} = \\mathbf{PKP}\\).\nDecompose \\(\\mathbf{K} = \\U \\D^2 \\U^\\top\\) (eigendecomposition).\nEmbed into \\(M\\leq p\\) dimensions: \\[\\U_M \\D_M\\]\n\nThe “embedding” is \\(\\U_M \\D_M\\).\nThere are no “loadings”\n(\\(\\not\\exists\\ \\mathbf{B}\\) such that \\(\\X\\mathbf{B} = \\U_M \\D_M\\))"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#other-common-pitfalls",
- "href": "schedule/slides/23-nnets-other.html#other-common-pitfalls",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#why-is-this-the-solution",
+ "href": "schedule/slides/26-pca-v-kpca.html#why-is-this-the-solution",
"title": "UBC Stat406 2023W",
- "section": "Other common pitfalls",
- "text": "Other common pitfalls\nThere are a few areas to watch out for\nNonconvexity:\nThe neural network optimization problem is non-convex.\nThis makes any numerical solution highly dependent on the initial values. These should be\n\nchosen carefully, typically random near 0. DON’T use all 0.\nregenerated several times to check sensitivity\n\nScaling:\nBe sure to standardize the covariates before training"
+ "section": "Why is this the solution?",
+ "text": "Why is this the solution?\nThe “maximize variance” version of PCA:\n\\[\\max_\\alpha \\Var{\\X\\alpha} \\quad \\textrm{ subject to } \\quad \\left|\\left| \\alpha \\right|\\right|_2^2 = 1\\]\n( \\(\\Var{\\X\\alpha} = \\alpha^\\top\\X^\\top\\X\\alpha\\) )\nThis is equivalent to solving (Lagrangian):\n\\[\\max_\\alpha \\alpha^\\top\\X^\\top\\X\\alpha - \\lambda\\left|\\left| \\alpha \\right|\\right|_2^2\\]\nTake derivative wrt \\(\\alpha\\) and set to 0:\n\\[0 = 2\\X^\\top\\X\\alpha - 2\\lambda\\alpha\\]\nThis is the equation for an eigenproblem. The solution is \\(\\alpha=\\V_1\\) and the maximum is \\(\\D_1^2\\)."
},
{
- "objectID": "schedule/slides/23-nnets-other.html#other-common-pitfalls-1",
- "href": "schedule/slides/23-nnets-other.html#other-common-pitfalls-1",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#example-not-real-unless-theres-code",
+ "href": "schedule/slides/26-pca-v-kpca.html#example-not-real-unless-theres-code",
"title": "UBC Stat406 2023W",
- "section": "Other common pitfalls",
- "text": "Other common pitfalls\nNumber of hidden units:\nIt is generally better to have too many hidden units than too few (regularization can eliminate some).\nSifting the output:\n\nChoose the solution that minimizes training error\nChoose the solution that minimizes the penalized training error\nAverage the solutions across runs"
+ "section": "Example (not real unless there’s code)",
+ "text": "Example (not real unless there’s code)\n\ndata(\"mobility\", package = \"Stat406\")\nX <- mobility %>%\n select(Black:Married) %>%\n as.matrix()\nnot_missing <- X %>% complete.cases()\nX <- scale(X[not_missing, ], center = TRUE, scale = TRUE)\ncolors <- mobility$Mobility[not_missing]\nM <- 2 # embedding dimension\nP <- diag(nrow(X)) - 1 / nrow(X)\n\n\n\nPCA: (all 3 are equivalent)\n\ns <- svd(X) # use svd\npca_loadings <- s$v[, 1:M]\npca_scores <- X %*% pca_loadings\n\n\ns <- eigen(t(X) %*% X) # V D^2 V'\npca_loadings <- s$vectors[, 1:M]\npca_scores <- X %*% pca_loadings\n\n\ns <- eigen(X %*% t(X)) # U D^2 U'\nD <- sqrt(diag(s$values[1:M]))\nU <- s$vectors[, 1:M]\npca_scores <- U %*% D\npca_loadings <- (1 / D) %*% t(U) %*% X\n\n\n\nKPCA:\n\nd <- 2\nK <- P %*% (1 + X %*% t(X))^d %*% P # polynomial\ne <- eigen(K) # U D^2 U'\n# (different from the PCA one, K /= XX')\nU <- e$vectors[, 1:M]\nD <- diag(sqrt(e$values[1:M]))\nkpca_poly <- U %*% D\n\n\nK <- P %*% tanh(1 + X %*% t(X)) %*% P # sigmoid kernel\ne <- eigen(K) # U D^2 U'\n# (different from the PCA one, K /= XX')\nU <- e$vectors[, 1:M]\nD <- diag(sqrt(e$values[1:M]))\nkpca_sigmoid <- U %*% D"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#tuning-parameters",
- "href": "schedule/slides/23-nnets-other.html#tuning-parameters",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#plotting",
+ "href": "schedule/slides/26-pca-v-kpca.html#plotting",
"title": "UBC Stat406 2023W",
- "section": "Tuning parameters",
- "text": "Tuning parameters\nThere are many.\n\nRegularization\nStopping criterion\nlearning rate\nArchitecture\nDropout %\nothers…\n\nThese are hard to tune.\nIn practice, people might choose “some” with a validation set, and fix the rest largely arbitrarily\n\nMore often, people set them all arbitrarily"
+ "section": "Plotting",
+ "text": "Plotting"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#thoughts-on-nnets",
- "href": "schedule/slides/23-nnets-other.html#thoughts-on-nnets",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#pca-loadings",
+ "href": "schedule/slides/26-pca-v-kpca.html#pca-loadings",
"title": "UBC Stat406 2023W",
- "section": "Thoughts on NNets",
- "text": "Thoughts on NNets\nOff the top of my head, without lots of justification\n\n\n🤬😡 Why don’t statisticians like them? 🤬😡\n\nThere is little theory (though this is increasing)\nStat theory applies to global minima, here, only local determined by the optimizer\nLittle understanding of when they work\nIn large part, NNets look like logistic regression + feature creation. We understand that well, and in many applications, it performs as well\nExplosion of tuning parameters without a way to decide\nRequire massive datasets to work\nLots of examples where they perform exceedingly poorly\n\n\n\n🔥🔥Why are they hot?🔥🔥\n\nPerform exceptionally well on typical CS tasks (images, translation)\nTake advantage of SOTA computing (parallel, GPUs)\nVery good for multinomial logistic regression\nAn excellent example of “transfer learning”\nThey generate pretty pictures (the nets, pseudo-responses at hidden units)"
+ "section": "PCA loadings",
+ "text": "PCA loadings\nShowing the first 10 PCA loadings:\n\nFirst column are the weights on the first score\neach number corresponds to a variable in the original data\nHow much does that variable contribute to that score?\n\n\nhead(round(pca_loadings, 2), 10)\n\n [,1] [,2]\n [1,] 0.25 0.07\n [2,] 0.13 -0.14\n [3,] 0.17 -0.34\n [4,] 0.18 -0.33\n [5,] 0.16 -0.34\n [6,] -0.24 0.11\n [7,] -0.04 -0.35\n [8,] 0.28 0.00\n [9,] 0.13 -0.14\n[10,] 0.29 0.10"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#keras",
- "href": "schedule/slides/23-nnets-other.html#keras",
+ "objectID": "schedule/slides/26-pca-v-kpca.html#kpca-feature-map-version",
+ "href": "schedule/slides/26-pca-v-kpca.html#kpca-feature-map-version",
"title": "UBC Stat406 2023W",
- "section": "Keras",
- "text": "Keras\nMost people who do deep learning use Python \\(+\\) Keras \\(+\\) Tensorflow\nIt takes some work to get all this software up and running.\nIt is possible to do in with R using an interface to Keras.\n\nI used to try to do a walk-through, but the interface is quite brittle\nIf you want to explore, see the handout:\n\nKnitted: https://ubc-stat.github.io/stat-406-lectures/handouts/keras-nnet.html\nRmd: https://ubc-stat.github.io/stat-406-lectures/handouts/keras-nnet.Rmd"
+ "section": "KPCA, feature map version",
+ "text": "KPCA, feature map version\n\np <- ncol(X)\nscX <- scale(X)\nwidth <- p * (p - 1) / 2 + p # ~630\nZ <- matrix(NA, nrow(X), width)\nk <- 0\nfor (i in 1:p) {\n for (j in i:p) {\n k <- k + 1\n Z[, k] <- X[, i] * X[, j]\n }\n}\nwideX <- scale(cbind(X, Z))\ns <- RSpectra::svds(wideX, 2) # the whole svd would be super slow\nfkpca_scores <- s$u %*% diag(s$d)\n\n\nUnfortunately, can’t easily compare to check whether the result is the same\nAlso can cause numerical issues\nBut should be the “same” (assuming I didn’t screw up…)\nWould also allow me to get the loadings, though they’d depend on polynomials\n\n\n\nUBC Stat 406 - 2023"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#section",
- "href": "schedule/slides/23-nnets-other.html#section",
+ "objectID": "schedule/slides/28-hclust.html#meta-lecture",
+ "href": "schedule/slides/28-hclust.html#meta-lecture",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "The Bias-Variance Trade-Off & \"DOUBLE DESCENT\" 🧵Remember the bias-variance trade-off? It says that models perform well for an \"intermediate level of flexibility\". You've seen the picture of the U-shape test error curve.We try to hit the \"sweet spot\" of flexibility.1/🧵 pic.twitter.com/HPk05izkZh— Daniela Witten (@daniela_witten) August 9, 2020"
+ "section": "28 Hierarchical clustering",
+ "text": "28 Hierarchical clustering\nStat 406\nDaniel J. McDonald\nLast modified – 01 November 2023\n\\[\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n\\]"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#where-does-this-u-shape-come-from",
- "href": "schedule/slides/23-nnets-other.html#where-does-this-u-shape-come-from",
+ "objectID": "schedule/slides/28-hclust.html#from-k-means-to-hierarchical-clustering",
+ "href": "schedule/slides/28-hclust.html#from-k-means-to-hierarchical-clustering",
"title": "UBC Stat406 2023W",
- "section": "Where does this U shape come from?",
- "text": "Where does this U shape come from?\nMSE = Squared Bias + Variance + Irreducible Noise\nAs we increase flexibility:\n\nSquared bias goes down\nVariance goes up\nEventually, | \\(\\partial\\) Variance | \\(>\\) | \\(\\partial\\) Squared Bias |.\n\nGoal: Choose amount of flexibility to balance these and minimize MSE.\n\nUse CV or something to estimate MSE and decide how much flexibility."
+ "section": "From \\(K\\)-means to hierarchical clustering",
+ "text": "From \\(K\\)-means to hierarchical clustering\n\n\nK-means\n\nIt fits exactly \\(K\\) clusters.\nFinal clustering assignments depend on the chosen initial cluster centers.\n\nHierarchical clustering\n\nNo need to choose the number of clusters before hand.\nThere is no random component (nor choice of starting point).\n\nThere is a catch: we need to choose a way to measure the distance between clusters, called the linkage.\n\n\nSame data as the K-means example:\n\n\nCode\n# same data as K-means \"Dumb example\"\nheatmaply::ggheatmap(\n as.matrix(dist(rbind(X1, X2, X3))),\n showticklabels = c(FALSE, FALSE), hide_colorbar = TRUE\n)"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#section-1",
- "href": "schedule/slides/23-nnets-other.html#section-1",
+ "objectID": "schedule/slides/28-hclust.html#hierarchical-clustering",
+ "href": "schedule/slides/28-hclust.html#hierarchical-clustering",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "In the past few yrs, (and particularly in the context of deep learning) ppl have noticed \"double descent\" -- when you continue to fit increasingly flexible models that interpolate the training data, then the test error can start to DECREASE again!! Check it out: 3/ pic.twitter.com/Vo54tRVRNG— Daniela Witten (@daniela_witten) August 9, 2020"
+ "section": "Hierarchical clustering",
+ "text": "Hierarchical clustering\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGiven the linkage, hierarchical clustering produces a sequence of clustering assignments.\nAt one end, all points are in their own cluster.\nAt the other, all points are in one cluster.\nIn the middle, there are nontrivial solutions."
},
{
- "objectID": "schedule/slides/23-nnets-other.html#zero-training-error-and-model-saturation",
- "href": "schedule/slides/23-nnets-other.html#zero-training-error-and-model-saturation",
+ "objectID": "schedule/slides/28-hclust.html#agglomeration",
+ "href": "schedule/slides/28-hclust.html#agglomeration",
"title": "UBC Stat406 2023W",
- "section": "Zero training error and model saturation",
- "text": "Zero training error and model saturation\n\nIn Deep Learning, the recommendation is to “fit until you get zero training error”\nThis somehow magically, leads to a continued decrease in test error.\nSo, who cares about the Bias-Variance Trade off!!\n\n\nLesson:\nBV Trade off is not wrong. 😢\nThis is a misunderstanding of black box algorithms and flexibility.\nWe don’t even need deep learning to illustrate."
+ "section": "Agglomeration",
+ "text": "Agglomeration\n\n\n\n\n\n\n\n\n\n\n\n\nGiven these data points, an agglomerative algorithm chooses a cluster sequence by combining the points into groups.\nWe can also represent the sequence of clustering assignments as a dendrogram\nCutting the dendrogram horizontally partitions the data points into clusters\n\n\n\n\nNotation: Define \\(x_1,\\ldots, x_n\\) to be the data\nLet the dissimiliarities be \\(d_{ij}\\) between each pair \\(x_i, x_j\\)\nAt any level, clustering assignments can be expressed by sets \\(G = \\{ i_1, i_2, \\ldots, i_r\\}\\) giving the indicies of points in this group. Define \\(|G|\\) to be the size of \\(G\\).\n\n\nLinkage\n\nThe function \\(d(G,H)\\) that takes two groups \\(G,\\ H\\) and returns the linkage distance between them."
},
{
- "objectID": "schedule/slides/23-nnets-other.html#section-2",
- "href": "schedule/slides/23-nnets-other.html#section-2",
+ "objectID": "schedule/slides/28-hclust.html#agglomerative-clustering-given-the-linkage",
+ "href": "schedule/slides/28-hclust.html#agglomerative-clustering-given-the-linkage",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "library(splines)\nset.seed(20221102)\nn <- 20\ndf <- tibble(\n x = seq(-1.5 * pi, 1.5 * pi, length.out = n),\n y = sin(x) + runif(n, -0.5, 0.5)\n)\ng <- ggplot(df, aes(x, y)) + geom_point() + stat_function(fun = sin) + ylim(c(-2, 2))\ng + stat_smooth(method = lm, formula = y ~ bs(x, df = 4), se = FALSE, color = green) + # too smooth\n stat_smooth(method = lm, formula = y ~ bs(x, df = 8), se = FALSE, color = orange) # looks good"
+ "section": "Agglomerative clustering, given the linkage",
+ "text": "Agglomerative clustering, given the linkage\n\nStart with each point in its own group\nUntil there is only one cluster, repeatedly merge the two groups \\(G,H\\) that minimize \\(d(G,H)\\).\n\n\n\n\n\n\n\nImportant\n\n\n\\(d\\) measures the distance between GROUPS."
},
{
- "objectID": "schedule/slides/23-nnets-other.html#section-3",
- "href": "schedule/slides/23-nnets-other.html#section-3",
+ "objectID": "schedule/slides/28-hclust.html#single-linkage",
+ "href": "schedule/slides/28-hclust.html#single-linkage",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "xn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)\n# Spline by hand\nX <- bs(df$x, df = 20, intercept = TRUE)\nXn <- bs(xn, df = 20, intercept = TRUE)\nS <- svd(X)\nyhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)\ng + geom_line(data = tibble(x=xn, y=yhat), colour = orange) +\n ggtitle(\"20 degrees of freedom\")"
+ "section": "Single linkage",
+ "text": "Single linkage\nIn single linkage (a.k.a nearest-neighbor linkage), the linkage distance between \\(G,\\ H\\) is the smallest dissimilarity between two points in different groups: \\[d_{\\textrm{single}}(G,H) = \\min_{i \\in G, \\, j \\in H} d_{ij}\\]"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#section-4",
- "href": "schedule/slides/23-nnets-other.html#section-4",
+ "objectID": "schedule/slides/28-hclust.html#complete-linkage",
+ "href": "schedule/slides/28-hclust.html#complete-linkage",
"title": "UBC Stat406 2023W",
- "section": "",
- "text": "xn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)\n# Spline by hand\nX <- bs(df$x, df = 40, intercept = TRUE)\nXn <- bs(xn, df = 40, intercept = TRUE)\nS <- svd(X)\nyhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)\ng + geom_line(data = tibble(x = xn, y = yhat), colour = orange) +\n ggtitle(\"40 degrees of freedom\")"
+ "section": "Complete linkage",
+ "text": "Complete linkage\nIn complete linkage (i.e. farthest-neighbor linkage), linkage distance between \\(G,H\\) is the largest dissimilarity between two points in different clusters: \\[d_{\\textrm{complete}}(G,H) = \\max_{i \\in G,\\, j \\in H} d_{ij}.\\]"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#what-happened",
- "href": "schedule/slides/23-nnets-other.html#what-happened",
+ "objectID": "schedule/slides/28-hclust.html#average-linkage",
+ "href": "schedule/slides/28-hclust.html#average-linkage",
"title": "UBC Stat406 2023W",
- "section": "What happened?!",
- "text": "What happened?!\n\ndoffs <- 4:50\nmse <- function(x, y) mean((x - y)^2)\nget_errs <- function(doff) {\n X <- bs(df$x, df = doff, intercept = TRUE)\n Xn <- bs(xn, df = doff, intercept = TRUE)\n S <- svd(X)\n yh <- S$u %*% crossprod(S$u, df$y)\n bhat <- S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)\n yhat <- Xn %*% S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)\n nb <- sqrt(sum(bhat^2))\n tibble(train = mse(df$y, yh), test = mse(yhat, sin(xn)), norm = nb)\n}\nerrs <- map(doffs, get_errs) |>\n list_rbind() |> \n mutate(`degrees of freedom` = doffs) |> \n pivot_longer(train:test, values_to = \"error\")"
+ "section": "Average linkage",
+ "text": "Average linkage\nIn average linkage, the linkage distance between \\(G,H\\) is the average dissimilarity over all points in different clusters: \\[d_{\\textrm{average}}(G,H) = \\frac{1}{|G| \\cdot |H| }\\sum_{i \\in G, \\,j \\in H} d_{ij}.\\]"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#what-happened-1",
- "href": "schedule/slides/23-nnets-other.html#what-happened-1",
+ "objectID": "schedule/slides/28-hclust.html#common-properties",
+ "href": "schedule/slides/28-hclust.html#common-properties",
"title": "UBC Stat406 2023W",
- "section": "What happened?!",
- "text": "What happened?!\n\n\nCode\nggplot(errs, aes(`degrees of freedom`, error, color = name)) +\n geom_line(linewidth = 2) + \n coord_cartesian(ylim = c(0, .12)) +\n scale_x_log10() + \n scale_colour_manual(values = c(blue, orange), name = \"\") +\n geom_vline(xintercept = 20)"
+ "section": "Common properties",
+ "text": "Common properties\nSingle, complete, and average linkage share the following:\n\nThey all operate on the dissimilarities \\(d_{ij}\\).\nThis means that the points we are clustering can be quite general (number of mutations on a genome, polygons, faces, whatever).\nRunning agglomerative clustering with any of these linkages produces a dendrogram with no inversions\n“No inversions” means that the linkage distance between merged clusters only increases as we run the algorithm.\n\nIn other words, we can draw a proper dendrogram, where the height of a parent is always higher than the height of either daughter.\n(We’ll return to this again shortly)"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#what-happened-2",
- "href": "schedule/slides/23-nnets-other.html#what-happened-2",
+ "objectID": "schedule/slides/28-hclust.html#centroid-linkage",
+ "href": "schedule/slides/28-hclust.html#centroid-linkage",
"title": "UBC Stat406 2023W",
- "section": "What happened?!",
- "text": "What happened?!\n\n\nCode\nbest_test <- errs |> filter(name == \"test\")\nmin_norm <- best_test$norm[which.min(best_test$error)]\nggplot(best_test, aes(norm, error)) +\n geom_line(colour = blue, size = 2) + ylab(\"test error\") +\n geom_vline(xintercept = min_norm, colour = orange) +\n scale_y_log10() + scale_x_log10() + geom_vline(xintercept = 20)"
+ "section": "Centroid linkage",
+ "text": "Centroid linkage\nCentroid linkage is relatively new. We need \\(x_i \\in \\mathbb{R}^p\\).\n\\(\\overline{x}_G\\) and \\(\\overline{x}_H\\) are group averages\n\\(d_{\\textrm{centroid}} = ||\\overline{x}_G - \\overline{x}_H||_2^2\\)"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#degrees-of-freedom-and-complexity",
- "href": "schedule/slides/23-nnets-other.html#degrees-of-freedom-and-complexity",
+ "objectID": "schedule/slides/28-hclust.html#centroid-linkage-1",
+ "href": "schedule/slides/28-hclust.html#centroid-linkage-1",
"title": "UBC Stat406 2023W",
- "section": "Degrees of freedom and complexity",
- "text": "Degrees of freedom and complexity\n\nIn low dimensions (where \\(n \\gg p\\)), with linear smoothers, df and model complexity are roughly the same.\nBut this relationship breaks down in more complicated settings\nWe’ve already seen this:\n\n\nlibrary(glmnet)\nout <- cv.glmnet(X, df$y, nfolds = n) # leave one out\n\n\n\nCode\nwith(\n out, \n tibble(lambda = lambda, df = nzero, cv = cvm, cvup = cvup, cvlo = cvlo )\n) |> \n filter(df > 0) |>\n pivot_longer(lambda:df) |> \n ggplot(aes(x = value)) +\n geom_errorbar(aes(ymax = cvup, ymin = cvlo)) +\n geom_point(aes(y = cv), colour = orange) +\n facet_wrap(~ name, strip.position = \"bottom\", scales = \"free_x\") +\n scale_y_log10() +\n scale_x_log10() + theme(axis.title.x = element_blank())"
+ "section": "Centroid linkage",
+ "text": "Centroid linkage\n\n\nCentroid linkage is\n\n… quite intuitive\n… nicely analogous to \\(K\\)-means.\n… very related to average linkage (and much, much faster)\n\nHowever, it may introduce inversions.\n\n\n\n\n\n\n\n\n\n\n\n\n\nCode\ntt <- seq(0, 2 * pi, len = 50)\ntt2 <- seq(0, 2 * pi, len = 75)\nc1 <- tibble(x = cos(tt), y = sin(tt))\nc2 <- tibble(x = 1.5 * cos(tt2), y = 1.5 * sin(tt2))\ncircles <- bind_rows(c1, c2)\ndi <- dist(circles[, 1:2])\nhc <- hclust(di, method = \"centroid\")\npar(mar = c(.1, 5, 3, .1))\nplot(hc, xlab = \"\")"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#infinite-solutions",
- "href": "schedule/slides/23-nnets-other.html#infinite-solutions",
+ "objectID": "schedule/slides/28-hclust.html#shortcomings-of-some-linkages",
+ "href": "schedule/slides/28-hclust.html#shortcomings-of-some-linkages",
"title": "UBC Stat406 2023W",
- "section": "Infinite solutions",
- "text": "Infinite solutions\n\nIn Lasso, df is not really the right measure of complexity\nBetter is \\(\\lambda\\) or the norm of the coefficients (these are basically the same)\nSo what happened with the Splines?\n\n\n\nWhen df \\(= 20\\), there’s a unique solution that interpolates the data\nWhen df \\(> 20\\), there are infinitely many solutions that interpolate the data.\n\nBecause we used the SVD to solve the system, we happened to pick one: the one that has the smallest \\(\\Vert\\hat\\beta\\Vert_2\\)\nRecent work in Deep Learning shows that SGD has the same property: it returns the local optima with the smallest norm.\nIf we measure complexity in terms of the norm of the weights, rather than by counting parameters, we don’t see double descent anymore."
+ "section": "Shortcomings of some linkages",
+ "text": "Shortcomings of some linkages\n\nSingle\n\n👎 chaining — a single pair of close points merges two clusters. \\(\\Rightarrow\\) clusters can be too spread out, not compact\n\nComplete linkage\n\n👎 crowding — a point can be closer to points in other clusters than to points in its own cluster.\\(\\Rightarrow\\) clusters are compact, not far enough apart.\n\nAverage linkage\n\ntries to strike a balance these\n\n\n👎 Unclear what properties the resulting clusters have when we cut an average linkage tree.\n\n\n👎 Results change with a monotone increasing transformation of the dissimilarities\n\nCentroid linkage\n\n👎 same monotonicity problem\n\n\n👎 and inversions\n\nAll linkages\n\n⁇ where do we cut?"
},
{
- "objectID": "schedule/slides/23-nnets-other.html#the-lesson",
- "href": "schedule/slides/23-nnets-other.html#the-lesson",
+ "objectID": "schedule/slides/28-hclust.html#distances",
+ "href": "schedule/slides/28-hclust.html#distances",
"title": "UBC Stat406 2023W",
- "section": "The lesson",
- "text": "The lesson\n\nDeep learning isn’t magic.\nZero training error with lots of parameters doesn’t mean good test error.\nWe still need the bias variance tradeoff\nIt’s intuition still applies: more flexibility eventually leads to increased MSE\nBut we need to be careful how we measure complexity.\n\n\n\nThere is very interesting recent theory that says when we can expect lower test error to the right of the interpolation threshold than to the left."
+ "section": "Distances",
+ "text": "Distances\nNote how all the methods depend on the distance function\nCan do lots of things besides Euclidean\nThis is very important"
}
]
\ No newline at end of file
diff --git a/site_libs/revealjs/plugin/multiplex/multiplex.js b/site_libs/revealjs/plugin/multiplex/multiplex.js
new file mode 100644
index 0000000..c15414e
--- /dev/null
+++ b/site_libs/revealjs/plugin/multiplex/multiplex.js
@@ -0,0 +1,57 @@
+(function() {
+
+ // emulate async script load
+ window.addEventListener( 'load', function() {
+ var multiplex = Reveal.getConfig().multiplex;
+ var socketId = multiplex.id;
+ var socket = io.connect(multiplex.url);
+
+ function post( evt ) {
+ var messageData = {
+ state: Reveal.getState(),
+ secret: multiplex.secret,
+ socketId: multiplex.id,
+ content: (evt || {}).content
+ };
+ socket.emit( 'multiplex-statechanged', messageData );
+ };
+
+ // master
+ if (multiplex.secret !== null) {
+
+ // Don't emit events from inside of notes windows
+ if ( window.location.search.match( /receiver/gi ) ) { return; }
+
+ // post once the page is loaded, so the client follows also on "open URL".
+ post();
+
+ // Monitor events that trigger a change in state
+ Reveal.on( 'slidechanged', post );
+ Reveal.on( 'fragmentshown', post );
+ Reveal.on( 'fragmenthidden', post );
+ Reveal.on( 'overviewhidden', post );
+ Reveal.on( 'overviewshown', post );
+ Reveal.on( 'paused', post );
+ Reveal.on( 'resumed', post );
+ document.addEventListener( 'send', post ); // broadcast custom events sent by other plugins
+
+ // client
+ } else {
+ socket.on(multiplex.id, function(message) {
+ // ignore data from sockets that aren't ours
+ if (message.socketId !== socketId) { return; }
+ if( window.location.host === 'localhost:1947' ) return;
+
+ if ( message.state ) {
+ Reveal.setState(message.state);
+ }
+ if ( message.content ) {
+ // forward custom events to other plugins
+ var event = new CustomEvent('received');
+ event.content = message.content;
+ document.dispatchEvent( event );
+ }
+ });
+ }
+ });
+}());
\ No newline at end of file
diff --git a/site_libs/revealjs/plugin/multiplex/plugin.yml b/site_libs/revealjs/plugin/multiplex/plugin.yml
new file mode 100644
index 0000000..9ccda63
--- /dev/null
+++ b/site_libs/revealjs/plugin/multiplex/plugin.yml
@@ -0,0 +1,8 @@
+name: multiplex
+script: [socket.io.js, multiplex.js]
+register: false
+config:
+ multiplex:
+ secret: null
+ id: null
+ url: "https://reveal-multiplex.glitch.me/"
diff --git a/site_libs/revealjs/plugin/multiplex/socket.io.js b/site_libs/revealjs/plugin/multiplex/socket.io.js
new file mode 100644
index 0000000..270777b
--- /dev/null
+++ b/site_libs/revealjs/plugin/multiplex/socket.io.js
@@ -0,0 +1,9 @@
+/*!
+ * Socket.IO v2.3.0
+ * (c) 2014-2019 Guillermo Rauch
+ * Released under the MIT License.
+ */
+!function(t,e){"object"==typeof exports&&"object"==typeof module?module.exports=e():"function"==typeof define&&define.amd?define([],e):"object"==typeof exports?exports.io=e():t.io=e()}(this,function(){return function(t){function e(r){if(n[r])return n[r].exports;var o=n[r]={exports:{},id:r,loaded:!1};return t[r].call(o.exports,o,o.exports,e),o.loaded=!0,o.exports}var n={};return e.m=t,e.c=n,e.p="",e(0)}([function(t,e,n){function r(t,e){"object"==typeof t&&(e=t,t=void 0),e=e||{};var n,r=o(t),i=r.source,u=r.id,p=r.path,h=c[u]&&p in c[u].nsps,f=e.forceNew||e["force new connection"]||!1===e.multiplex||h;return f?(a("ignoring socket cache for %s",i),n=s(i,e)):(c[u]||(a("new io instance for %s",i),c[u]=s(i,e)),n=c[u]),r.query&&!e.query&&(e.query=r.query),n.socket(r.path,e)}var o=n(1),i=n(7),s=n(15),a=n(3)("socket.io-client");t.exports=e=r;var c=e.managers={};e.protocol=i.protocol,e.connect=r,e.Manager=n(15),e.Socket=n(39)},function(t,e,n){function r(t,e){var n=t;e=e||"undefined"!=typeof location&&location,null==t&&(t=e.protocol+"//"+e.host),"string"==typeof t&&("/"===t.charAt(0)&&(t="/"===t.charAt(1)?e.protocol+t:e.host+t),/^(https?|wss?):\/\//.test(t)||(i("protocol-less url %s",t),t="undefined"!=typeof e?e.protocol+"//"+t:"https://"+t),i("parse %s",t),n=o(t)),n.port||(/^(http|ws)$/.test(n.protocol)?n.port="80":/^(http|ws)s$/.test(n.protocol)&&(n.port="443")),n.path=n.path||"/";var r=n.host.indexOf(":")!==-1,s=r?"["+n.host+"]":n.host;return n.id=n.protocol+"://"+s+":"+n.port,n.href=n.protocol+"://"+s+(e&&e.port===n.port?"":":"+n.port),n}var o=n(2),i=n(3)("socket.io-client:url");t.exports=r},function(t,e){var n=/^(?:(?![^:@]+:[^:@\/]*@)(http|https|ws|wss):\/\/)?((?:(([^:@]*)(?::([^:@]*))?)?@)?((?:[a-f0-9]{0,4}:){2,7}[a-f0-9]{0,4}|[^:\/?#]*)(?::(\d*))?)(((\/(?:[^?#](?![^?#\/]*\.[^?#\/.]+(?:[?#]|$)))*\/?)?([^?#\/]*))(?:\?([^#]*))?(?:#(.*))?)/,r=["source","protocol","authority","userInfo","user","password","host","port","relative","path","directory","file","query","anchor"];t.exports=function(t){var e=t,o=t.indexOf("["),i=t.indexOf("]");o!=-1&&i!=-1&&(t=t.substring(0,o)+t.substring(o,i).replace(/:/g,";")+t.substring(i,t.length));for(var s=n.exec(t||""),a={},c=14;c--;)a[r[c]]=s[c]||"";return o!=-1&&i!=-1&&(a.source=e,a.host=a.host.substring(1,a.host.length-1).replace(/;/g,":"),a.authority=a.authority.replace("[","").replace("]","").replace(/;/g,":"),a.ipv6uri=!0),a}},function(t,e,n){(function(r){"use strict";function o(){return!("undefined"==typeof window||!window.process||"renderer"!==window.process.type&&!window.process.__nwjs)||("undefined"==typeof navigator||!navigator.userAgent||!navigator.userAgent.toLowerCase().match(/(edge|trident)\/(\d+)/))&&("undefined"!=typeof document&&document.documentElement&&document.documentElement.style&&document.documentElement.style.WebkitAppearance||"undefined"!=typeof window&&window.console&&(window.console.firebug||window.console.exception&&window.console.table)||"undefined"!=typeof navigator&&navigator.userAgent&&navigator.userAgent.toLowerCase().match(/firefox\/(\d+)/)&&parseInt(RegExp.$1,10)>=31||"undefined"!=typeof navigator&&navigator.userAgent&&navigator.userAgent.toLowerCase().match(/applewebkit\/(\d+)/))}function i(e){if(e[0]=(this.useColors?"%c":"")+this.namespace+(this.useColors?" %c":" ")+e[0]+(this.useColors?"%c ":" ")+"+"+t.exports.humanize(this.diff),this.useColors){var n="color: "+this.color;e.splice(1,0,n,"color: inherit");var r=0,o=0;e[0].replace(/%[a-zA-Z%]/g,function(t){"%%"!==t&&(r++,"%c"===t&&(o=r))}),e.splice(o,0,n)}}function s(){var t;return"object"===("undefined"==typeof console?"undefined":p(console))&&console.log&&(t=console).log.apply(t,arguments)}function a(t){try{t?e.storage.setItem("debug",t):e.storage.removeItem("debug")}catch(n){}}function c(){var t=void 0;try{t=e.storage.getItem("debug")}catch(n){}return!t&&"undefined"!=typeof r&&"env"in r&&(t=r.env.DEBUG),t}function u(){try{return localStorage}catch(t){}}var p="function"==typeof Symbol&&"symbol"==typeof Symbol.iterator?function(t){return typeof t}:function(t){return t&&"function"==typeof Symbol&&t.constructor===Symbol&&t!==Symbol.prototype?"symbol":typeof t};e.log=s,e.formatArgs=i,e.save=a,e.load=c,e.useColors=o,e.storage=u(),e.colors=["#0000CC","#0000FF","#0033CC","#0033FF","#0066CC","#0066FF","#0099CC","#0099FF","#00CC00","#00CC33","#00CC66","#00CC99","#00CCCC","#00CCFF","#3300CC","#3300FF","#3333CC","#3333FF","#3366CC","#3366FF","#3399CC","#3399FF","#33CC00","#33CC33","#33CC66","#33CC99","#33CCCC","#33CCFF","#6600CC","#6600FF","#6633CC","#6633FF","#66CC00","#66CC33","#9900CC","#9900FF","#9933CC","#9933FF","#99CC00","#99CC33","#CC0000","#CC0033","#CC0066","#CC0099","#CC00CC","#CC00FF","#CC3300","#CC3333","#CC3366","#CC3399","#CC33CC","#CC33FF","#CC6600","#CC6633","#CC9900","#CC9933","#CCCC00","#CCCC33","#FF0000","#FF0033","#FF0066","#FF0099","#FF00CC","#FF00FF","#FF3300","#FF3333","#FF3366","#FF3399","#FF33CC","#FF33FF","#FF6600","#FF6633","#FF9900","#FF9933","#FFCC00","#FFCC33"],t.exports=n(5)(e);var h=t.exports.formatters;h.j=function(t){try{return JSON.stringify(t)}catch(e){return"[UnexpectedJSONParseError]: "+e.message}}}).call(e,n(4))},function(t,e){function n(){throw new Error("setTimeout has not been defined")}function r(){throw new Error("clearTimeout has not been defined")}function o(t){if(p===setTimeout)return setTimeout(t,0);if((p===n||!p)&&setTimeout)return p=setTimeout,setTimeout(t,0);try{return p(t,0)}catch(e){try{return p.call(null,t,0)}catch(e){return p.call(this,t,0)}}}function i(t){if(h===clearTimeout)return clearTimeout(t);if((h===r||!h)&&clearTimeout)return h=clearTimeout,clearTimeout(t);try{return h(t)}catch(e){try{return h.call(null,t)}catch(e){return h.call(this,t)}}}function s(){y&&l&&(y=!1,l.length?d=l.concat(d):m=-1,d.length&&a())}function a(){if(!y){var t=o(s);y=!0;for(var e=d.length;e;){for(l=d,d=[];++m1)for(var n=1;n100)){var e=/^(-?(?:\d+)?\.?\d+) *(milliseconds?|msecs?|ms|seconds?|secs?|s|minutes?|mins?|m|hours?|hrs?|h|days?|d|weeks?|w|years?|yrs?|y)?$/i.exec(t);if(e){var n=parseFloat(e[1]),r=(e[2]||"ms").toLowerCase();switch(r){case"years":case"year":case"yrs":case"yr":case"y":return n*h;case"weeks":case"week":case"w":return n*p;case"days":case"day":case"d":return n*u;case"hours":case"hour":case"hrs":case"hr":case"h":return n*c;case"minutes":case"minute":case"mins":case"min":case"m":return n*a;case"seconds":case"second":case"secs":case"sec":case"s":return n*s;case"milliseconds":case"millisecond":case"msecs":case"msec":case"ms":return n;default:return}}}}function r(t){var e=Math.abs(t);return e>=u?Math.round(t/u)+"d":e>=c?Math.round(t/c)+"h":e>=a?Math.round(t/a)+"m":e>=s?Math.round(t/s)+"s":t+"ms"}function o(t){var e=Math.abs(t);return e>=u?i(t,e,u,"day"):e>=c?i(t,e,c,"hour"):e>=a?i(t,e,a,"minute"):e>=s?i(t,e,s,"second"):t+" ms"}function i(t,e,n,r){var o=e>=1.5*n;return Math.round(t/n)+" "+r+(o?"s":"")}var s=1e3,a=60*s,c=60*a,u=24*c,p=7*u,h=365.25*u;t.exports=function(t,e){e=e||{};var i=typeof t;if("string"===i&&t.length>0)return n(t);if("number"===i&&isFinite(t))return e["long"]?o(t):r(t);throw new Error("val is not a non-empty string or a valid number. val="+JSON.stringify(t))}},function(t,e,n){function r(){}function o(t){var n=""+t.type;if(e.BINARY_EVENT!==t.type&&e.BINARY_ACK!==t.type||(n+=t.attachments+"-"),t.nsp&&"/"!==t.nsp&&(n+=t.nsp+","),null!=t.id&&(n+=t.id),null!=t.data){var r=i(t.data);if(r===!1)return g;n+=r}return f("encoded %j as %s",t,n),n}function i(t){try{return JSON.stringify(t)}catch(e){return!1}}function s(t,e){function n(t){var n=d.deconstructPacket(t),r=o(n.packet),i=n.buffers;i.unshift(r),e(i)}d.removeBlobs(t,n)}function a(){this.reconstructor=null}function c(t){var n=0,r={type:Number(t.charAt(0))};if(null==e.types[r.type])return h("unknown packet type "+r.type);if(e.BINARY_EVENT===r.type||e.BINARY_ACK===r.type){for(var o="";"-"!==t.charAt(++n)&&(o+=t.charAt(n),n!=t.length););if(o!=Number(o)||"-"!==t.charAt(n))throw new Error("Illegal attachments");r.attachments=Number(o)}if("/"===t.charAt(n+1))for(r.nsp="";++n;){var i=t.charAt(n);if(","===i)break;if(r.nsp+=i,n===t.length)break}else r.nsp="/";var s=t.charAt(n+1);if(""!==s&&Number(s)==s){for(r.id="";++n;){var i=t.charAt(n);if(null==i||Number(i)!=i){--n;break}if(r.id+=t.charAt(n),n===t.length)break}r.id=Number(r.id)}if(t.charAt(++n)){var a=u(t.substr(n)),c=a!==!1&&(r.type===e.ERROR||y(a));if(!c)return h("invalid payload");r.data=a}return f("decoded %s as %j",t,r),r}function u(t){try{return JSON.parse(t)}catch(e){return!1}}function p(t){this.reconPack=t,this.buffers=[]}function h(t){return{type:e.ERROR,data:"parser error: "+t}}var f=n(8)("socket.io-parser"),l=n(11),d=n(12),y=n(13),m=n(14);e.protocol=4,e.types=["CONNECT","DISCONNECT","EVENT","ACK","ERROR","BINARY_EVENT","BINARY_ACK"],e.CONNECT=0,e.DISCONNECT=1,e.EVENT=2,e.ACK=3,e.ERROR=4,e.BINARY_EVENT=5,e.BINARY_ACK=6,e.Encoder=r,e.Decoder=a;var g=e.ERROR+'"encode error"';r.prototype.encode=function(t,n){if(f("encoding packet %j",t),e.BINARY_EVENT===t.type||e.BINARY_ACK===t.type)s(t,n);else{var r=o(t);n([r])}},l(a.prototype),a.prototype.add=function(t){var n;if("string"==typeof t)n=c(t),e.BINARY_EVENT===n.type||e.BINARY_ACK===n.type?(this.reconstructor=new p(n),0===this.reconstructor.reconPack.attachments&&this.emit("decoded",n)):this.emit("decoded",n);else{if(!m(t)&&!t.base64)throw new Error("Unknown type: "+t);if(!this.reconstructor)throw new Error("got binary data when not reconstructing a packet");n=this.reconstructor.takeBinaryData(t),n&&(this.reconstructor=null,this.emit("decoded",n))}},a.prototype.destroy=function(){this.reconstructor&&this.reconstructor.finishedReconstruction()},p.prototype.takeBinaryData=function(t){if(this.buffers.push(t),this.buffers.length===this.reconPack.attachments){var e=d.reconstructPacket(this.reconPack,this.buffers);return this.finishedReconstruction(),e}return null},p.prototype.finishedReconstruction=function(){this.reconPack=null,this.buffers=[]}},function(t,e,n){(function(r){"use strict";function o(){return!("undefined"==typeof window||!window.process||"renderer"!==window.process.type)||("undefined"==typeof navigator||!navigator.userAgent||!navigator.userAgent.toLowerCase().match(/(edge|trident)\/(\d+)/))&&("undefined"!=typeof document&&document.documentElement&&document.documentElement.style&&document.documentElement.style.WebkitAppearance||"undefined"!=typeof window&&window.console&&(window.console.firebug||window.console.exception&&window.console.table)||"undefined"!=typeof navigator&&navigator.userAgent&&navigator.userAgent.toLowerCase().match(/firefox\/(\d+)/)&&parseInt(RegExp.$1,10)>=31||"undefined"!=typeof navigator&&navigator.userAgent&&navigator.userAgent.toLowerCase().match(/applewebkit\/(\d+)/))}function i(t){var n=this.useColors;if(t[0]=(n?"%c":"")+this.namespace+(n?" %c":" ")+t[0]+(n?"%c ":" ")+"+"+e.humanize(this.diff),n){var r="color: "+this.color;t.splice(1,0,r,"color: inherit");var o=0,i=0;t[0].replace(/%[a-zA-Z%]/g,function(t){"%%"!==t&&(o++,"%c"===t&&(i=o))}),t.splice(i,0,r)}}function s(){return"object"===("undefined"==typeof console?"undefined":p(console))&&console.log&&Function.prototype.apply.call(console.log,console,arguments)}function a(t){try{null==t?e.storage.removeItem("debug"):e.storage.debug=t}catch(n){}}function c(){var t;try{t=e.storage.debug}catch(n){}return!t&&"undefined"!=typeof r&&"env"in r&&(t=r.env.DEBUG),t}function u(){try{return window.localStorage}catch(t){}}var p="function"==typeof Symbol&&"symbol"==typeof Symbol.iterator?function(t){return typeof t}:function(t){return t&&"function"==typeof Symbol&&t.constructor===Symbol&&t!==Symbol.prototype?"symbol":typeof t};e=t.exports=n(9),e.log=s,e.formatArgs=i,e.save=a,e.load=c,e.useColors=o,e.storage="undefined"!=typeof chrome&&"undefined"!=typeof chrome.storage?chrome.storage.local:u(),e.colors=["#0000CC","#0000FF","#0033CC","#0033FF","#0066CC","#0066FF","#0099CC","#0099FF","#00CC00","#00CC33","#00CC66","#00CC99","#00CCCC","#00CCFF","#3300CC","#3300FF","#3333CC","#3333FF","#3366CC","#3366FF","#3399CC","#3399FF","#33CC00","#33CC33","#33CC66","#33CC99","#33CCCC","#33CCFF","#6600CC","#6600FF","#6633CC","#6633FF","#66CC00","#66CC33","#9900CC","#9900FF","#9933CC","#9933FF","#99CC00","#99CC33","#CC0000","#CC0033","#CC0066","#CC0099","#CC00CC","#CC00FF","#CC3300","#CC3333","#CC3366","#CC3399","#CC33CC","#CC33FF","#CC6600","#CC6633","#CC9900","#CC9933","#CCCC00","#CCCC33","#FF0000","#FF0033","#FF0066","#FF0099","#FF00CC","#FF00FF","#FF3300","#FF3333","#FF3366","#FF3399","#FF33CC","#FF33FF","#FF6600","#FF6633","#FF9900","#FF9933","#FFCC00","#FFCC33"],e.formatters.j=function(t){try{return JSON.stringify(t)}catch(e){return"[UnexpectedJSONParseError]: "+e.message}},e.enable(c())}).call(e,n(4))},function(t,e,n){"use strict";function r(t){var n,r=0;for(n in t)r=(r<<5)-r+t.charCodeAt(n),r|=0;return e.colors[Math.abs(r)%e.colors.length]}function o(t){function n(){if(n.enabled){var t=n,r=+new Date,i=r-(o||r);t.diff=i,t.prev=o,t.curr=r,o=r;for(var s=new Array(arguments.length),a=0;a100)){var e=/^((?:\d+)?\.?\d+) *(milliseconds?|msecs?|ms|seconds?|secs?|s|minutes?|mins?|m|hours?|hrs?|h|days?|d|years?|yrs?|y)?$/i.exec(t);if(e){var n=parseFloat(e[1]),r=(e[2]||"ms").toLowerCase();switch(r){case"years":case"year":case"yrs":case"yr":case"y":return n*p;case"days":case"day":case"d":return n*u;case"hours":case"hour":case"hrs":case"hr":case"h":return n*c;case"minutes":case"minute":case"mins":case"min":case"m":return n*a;case"seconds":case"second":case"secs":case"sec":case"s":return n*s;case"milliseconds":case"millisecond":case"msecs":case"msec":case"ms":return n;default:return}}}}function r(t){return t>=u?Math.round(t/u)+"d":t>=c?Math.round(t/c)+"h":t>=a?Math.round(t/a)+"m":t>=s?Math.round(t/s)+"s":t+"ms"}function o(t){return i(t,u,"day")||i(t,c,"hour")||i(t,a,"minute")||i(t,s,"second")||t+" ms"}function i(t,e,n){if(!(t0)return n(t);if("number"===i&&isNaN(t)===!1)return e["long"]?o(t):r(t);throw new Error("val is not a non-empty string or a valid number. val="+JSON.stringify(t))}},function(t,e,n){function r(t){if(t)return o(t)}function o(t){for(var e in r.prototype)t[e]=r.prototype[e];return t}t.exports=r,r.prototype.on=r.prototype.addEventListener=function(t,e){return this._callbacks=this._callbacks||{},(this._callbacks["$"+t]=this._callbacks["$"+t]||[]).push(e),this},r.prototype.once=function(t,e){function n(){this.off(t,n),e.apply(this,arguments)}return n.fn=e,this.on(t,n),this},r.prototype.off=r.prototype.removeListener=r.prototype.removeAllListeners=r.prototype.removeEventListener=function(t,e){if(this._callbacks=this._callbacks||{},0==arguments.length)return this._callbacks={},this;var n=this._callbacks["$"+t];if(!n)return this;if(1==arguments.length)return delete this._callbacks["$"+t],this;for(var r,o=0;o0&&!this.encoding){var t=this.packetBuffer.shift();this.packet(t)}},r.prototype.cleanup=function(){p("cleanup");for(var t=this.subs.length,e=0;e=this._reconnectionAttempts)p("reconnect failed"),this.backoff.reset(),this.emitAll("reconnect_failed"),this.reconnecting=!1;else{var e=this.backoff.duration();p("will wait %dms before reconnect attempt",e),this.reconnecting=!0;var n=setTimeout(function(){t.skipReconnect||(p("attempting reconnect"),t.emitAll("reconnect_attempt",t.backoff.attempts),t.emitAll("reconnecting",t.backoff.attempts),t.skipReconnect||t.open(function(e){e?(p("reconnect attempt error"),t.reconnecting=!1,t.reconnect(),t.emitAll("reconnect_error",e.data)):(p("reconnect success"),t.onreconnect())}))},e);this.subs.push({destroy:function(){clearTimeout(n)}})}},r.prototype.onreconnect=function(){var t=this.backoff.attempts;this.reconnecting=!1,this.backoff.reset(),this.updateSocketIds(),this.emitAll("reconnect",t)}},function(t,e,n){t.exports=n(17),t.exports.parser=n(24)},function(t,e,n){function r(t,e){return this instanceof r?(e=e||{},t&&"object"==typeof t&&(e=t,t=null),t?(t=p(t),e.hostname=t.host,e.secure="https"===t.protocol||"wss"===t.protocol,e.port=t.port,t.query&&(e.query=t.query)):e.host&&(e.hostname=p(e.host).host),this.secure=null!=e.secure?e.secure:"undefined"!=typeof location&&"https:"===location.protocol,e.hostname&&!e.port&&(e.port=this.secure?"443":"80"),this.agent=e.agent||!1,this.hostname=e.hostname||("undefined"!=typeof location?location.hostname:"localhost"),this.port=e.port||("undefined"!=typeof location&&location.port?location.port:this.secure?443:80),this.query=e.query||{},"string"==typeof this.query&&(this.query=h.decode(this.query)),this.upgrade=!1!==e.upgrade,this.path=(e.path||"/engine.io").replace(/\/$/,"")+"/",this.forceJSONP=!!e.forceJSONP,this.jsonp=!1!==e.jsonp,this.forceBase64=!!e.forceBase64,this.enablesXDR=!!e.enablesXDR,this.withCredentials=!1!==e.withCredentials,this.timestampParam=e.timestampParam||"t",this.timestampRequests=e.timestampRequests,this.transports=e.transports||["polling","websocket"],this.transportOptions=e.transportOptions||{},this.readyState="",this.writeBuffer=[],this.prevBufferLen=0,this.policyPort=e.policyPort||843,this.rememberUpgrade=e.rememberUpgrade||!1,this.binaryType=null,this.onlyBinaryUpgrades=e.onlyBinaryUpgrades,this.perMessageDeflate=!1!==e.perMessageDeflate&&(e.perMessageDeflate||{}),!0===this.perMessageDeflate&&(this.perMessageDeflate={}),this.perMessageDeflate&&null==this.perMessageDeflate.threshold&&(this.perMessageDeflate.threshold=1024),this.pfx=e.pfx||null,this.key=e.key||null,this.passphrase=e.passphrase||null,this.cert=e.cert||null,this.ca=e.ca||null,this.ciphers=e.ciphers||null,this.rejectUnauthorized=void 0===e.rejectUnauthorized||e.rejectUnauthorized,this.forceNode=!!e.forceNode,this.isReactNative="undefined"!=typeof navigator&&"string"==typeof navigator.product&&"reactnative"===navigator.product.toLowerCase(),("undefined"==typeof self||this.isReactNative)&&(e.extraHeaders&&Object.keys(e.extraHeaders).length>0&&(this.extraHeaders=e.extraHeaders),e.localAddress&&(this.localAddress=e.localAddress)),this.id=null,this.upgrades=null,this.pingInterval=null,this.pingTimeout=null,this.pingIntervalTimer=null,this.pingTimeoutTimer=null,void this.open()):new r(t,e)}function o(t){var e={};for(var n in t)t.hasOwnProperty(n)&&(e[n]=t[n]);return e}var i=n(18),s=n(11),a=n(3)("engine.io-client:socket"),c=n(38),u=n(24),p=n(2),h=n(32);t.exports=r,r.priorWebsocketSuccess=!1,s(r.prototype),r.protocol=u.protocol,r.Socket=r,r.Transport=n(23),r.transports=n(18),r.parser=n(24),r.prototype.createTransport=function(t){a('creating transport "%s"',t);var e=o(this.query);e.EIO=u.protocol,e.transport=t;var n=this.transportOptions[t]||{};this.id&&(e.sid=this.id);var r=new i[t]({query:e,socket:this,agent:n.agent||this.agent,hostname:n.hostname||this.hostname,port:n.port||this.port,secure:n.secure||this.secure,path:n.path||this.path,forceJSONP:n.forceJSONP||this.forceJSONP,jsonp:n.jsonp||this.jsonp,forceBase64:n.forceBase64||this.forceBase64,enablesXDR:n.enablesXDR||this.enablesXDR,withCredentials:n.withCredentials||this.withCredentials,timestampRequests:n.timestampRequests||this.timestampRequests,timestampParam:n.timestampParam||this.timestampParam,policyPort:n.policyPort||this.policyPort,pfx:n.pfx||this.pfx,key:n.key||this.key,passphrase:n.passphrase||this.passphrase,cert:n.cert||this.cert,ca:n.ca||this.ca,ciphers:n.ciphers||this.ciphers,rejectUnauthorized:n.rejectUnauthorized||this.rejectUnauthorized,perMessageDeflate:n.perMessageDeflate||this.perMessageDeflate,extraHeaders:n.extraHeaders||this.extraHeaders,forceNode:n.forceNode||this.forceNode,localAddress:n.localAddress||this.localAddress,requestTimeout:n.requestTimeout||this.requestTimeout,protocols:n.protocols||void 0,isReactNative:this.isReactNative});return r},r.prototype.open=function(){var t;if(this.rememberUpgrade&&r.priorWebsocketSuccess&&this.transports.indexOf("websocket")!==-1)t="websocket";else{
+if(0===this.transports.length){var e=this;return void setTimeout(function(){e.emit("error","No transports available")},0)}t=this.transports[0]}this.readyState="opening";try{t=this.createTransport(t)}catch(n){return this.transports.shift(),void this.open()}t.open(),this.setTransport(t)},r.prototype.setTransport=function(t){a("setting transport %s",t.name);var e=this;this.transport&&(a("clearing existing transport %s",this.transport.name),this.transport.removeAllListeners()),this.transport=t,t.on("drain",function(){e.onDrain()}).on("packet",function(t){e.onPacket(t)}).on("error",function(t){e.onError(t)}).on("close",function(){e.onClose("transport close")})},r.prototype.probe=function(t){function e(){if(f.onlyBinaryUpgrades){var e=!this.supportsBinary&&f.transport.supportsBinary;h=h||e}h||(a('probe transport "%s" opened',t),p.send([{type:"ping",data:"probe"}]),p.once("packet",function(e){if(!h)if("pong"===e.type&&"probe"===e.data){if(a('probe transport "%s" pong',t),f.upgrading=!0,f.emit("upgrading",p),!p)return;r.priorWebsocketSuccess="websocket"===p.name,a('pausing current transport "%s"',f.transport.name),f.transport.pause(function(){h||"closed"!==f.readyState&&(a("changing transport and sending upgrade packet"),u(),f.setTransport(p),p.send([{type:"upgrade"}]),f.emit("upgrade",p),p=null,f.upgrading=!1,f.flush())})}else{a('probe transport "%s" failed',t);var n=new Error("probe error");n.transport=p.name,f.emit("upgradeError",n)}}))}function n(){h||(h=!0,u(),p.close(),p=null)}function o(e){var r=new Error("probe error: "+e);r.transport=p.name,n(),a('probe transport "%s" failed because of error: %s',t,e),f.emit("upgradeError",r)}function i(){o("transport closed")}function s(){o("socket closed")}function c(t){p&&t.name!==p.name&&(a('"%s" works - aborting "%s"',t.name,p.name),n())}function u(){p.removeListener("open",e),p.removeListener("error",o),p.removeListener("close",i),f.removeListener("close",s),f.removeListener("upgrading",c)}a('probing transport "%s"',t);var p=this.createTransport(t,{probe:1}),h=!1,f=this;r.priorWebsocketSuccess=!1,p.once("open",e),p.once("error",o),p.once("close",i),this.once("close",s),this.once("upgrading",c),p.open()},r.prototype.onOpen=function(){if(a("socket open"),this.readyState="open",r.priorWebsocketSuccess="websocket"===this.transport.name,this.emit("open"),this.flush(),"open"===this.readyState&&this.upgrade&&this.transport.pause){a("starting upgrade probes");for(var t=0,e=this.upgrades.length;t1?{type:b[o],data:t.substring(1)}:{type:b[o]}:C}var i=new Uint8Array(t),o=i[0],s=f(t,1);return w&&"blob"===n&&(s=new w([s])),{type:b[o],data:s}},e.decodeBase64Packet=function(t,e){var n=b[t.charAt(0)];if(!u)return{type:n,data:{base64:!0,data:t.substr(1)}};var r=u.decode(t.substr(1));return"blob"===e&&w&&(r=new w([r])),{type:n,data:r}},e.encodePayload=function(t,n,r){function o(t){return t.length+":"+t}function i(t,r){e.encodePacket(t,!!s&&n,!1,function(t){r(null,o(t))})}"function"==typeof n&&(r=n,n=null);var s=h(t);return n&&s?w&&!g?e.encodePayloadAsBlob(t,r):e.encodePayloadAsArrayBuffer(t,r):t.length?void c(t,i,function(t,e){return r(e.join(""))}):r("0:")},e.decodePayload=function(t,n,r){if("string"!=typeof t)return e.decodePayloadAsBinary(t,n,r);"function"==typeof n&&(r=n,n=null);var o;if(""===t)return r(C,0,1);for(var i,s,a="",c=0,u=t.length;c0;){for(var s=new Uint8Array(o),a=0===s[0],c="",u=1;255!==s[u];u++){if(c.length>310)return r(C,0,1);c+=s[u]}o=f(o,2+c.length),c=parseInt(c);var p=f(o,0,c);if(a)try{p=String.fromCharCode.apply(null,new Uint8Array(p))}catch(h){var l=new Uint8Array(p);p="";for(var u=0;ur&&(n=r),e>=r||e>=n||0===r)return new ArrayBuffer(0);for(var o=new Uint8Array(t),i=new Uint8Array(n-e),s=e,a=0;s=55296&&e<=56319&&o65535&&(e-=65536,o+=d(e>>>10&1023|55296),e=56320|1023&e),o+=d(e);return o}function o(t,e){if(t>=55296&&t<=57343){if(e)throw Error("Lone surrogate U+"+t.toString(16).toUpperCase()+" is not a scalar value");return!1}return!0}function i(t,e){return d(t>>e&63|128)}function s(t,e){if(0==(4294967168&t))return d(t);var n="";return 0==(4294965248&t)?n=d(t>>6&31|192):0==(4294901760&t)?(o(t,e)||(t=65533),n=d(t>>12&15|224),n+=i(t,6)):0==(4292870144&t)&&(n=d(t>>18&7|240),n+=i(t,12),n+=i(t,6)),n+=d(63&t|128)}function a(t,e){e=e||{};for(var r,o=!1!==e.strict,i=n(t),a=i.length,c=-1,u="";++c=f)throw Error("Invalid byte index");var t=255&h[l];if(l++,128==(192&t))return 63&t;throw Error("Invalid continuation byte")}function u(t){var e,n,r,i,s;if(l>f)throw Error("Invalid byte index");if(l==f)return!1;if(e=255&h[l],l++,0==(128&e))return e;if(192==(224&e)){if(n=c(),s=(31&e)<<6|n,s>=128)return s;throw Error("Invalid continuation byte")}if(224==(240&e)){if(n=c(),r=c(),s=(15&e)<<12|n<<6|r,s>=2048)return o(s,t)?s:65533;throw Error("Invalid continuation byte")}if(240==(248&e)&&(n=c(),r=c(),i=c(),s=(7&e)<<18|n<<12|r<<6|i,s>=65536&&s<=1114111))return s;throw Error("Invalid UTF-8 detected")}function p(t,e){e=e||{};var o=!1!==e.strict;h=n(t),f=h.length,l=0;for(var i,s=[];(i=u(o))!==!1;)s.push(i);return r(s)}/*! https://mths.be/utf8js v2.1.2 by @mathias */
+var h,f,l,d=String.fromCharCode;t.exports={version:"2.1.2",encode:a,decode:p}},function(t,e){!function(){"use strict";for(var t="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/",n=new Uint8Array(256),r=0;r>2],i+=t[(3&r[n])<<4|r[n+1]>>4],i+=t[(15&r[n+1])<<2|r[n+2]>>6],i+=t[63&r[n+2]];return o%3===2?i=i.substring(0,i.length-1)+"=":o%3===1&&(i=i.substring(0,i.length-2)+"=="),i},e.decode=function(t){var e,r,o,i,s,a=.75*t.length,c=t.length,u=0;"="===t[t.length-1]&&(a--,"="===t[t.length-2]&&a--);var p=new ArrayBuffer(a),h=new Uint8Array(p);for(e=0;e>4,h[u++]=(15&o)<<4|i>>2,h[u++]=(3&i)<<6|63&s;return p}}()},function(t,e){function n(t){return t.map(function(t){if(t.buffer instanceof ArrayBuffer){var e=t.buffer;if(t.byteLength!==e.byteLength){var n=new Uint8Array(t.byteLength);n.set(new Uint8Array(e,t.byteOffset,t.byteLength)),e=n.buffer}return e}return t})}function r(t,e){e=e||{};var r=new i;return n(t).forEach(function(t){r.append(t)}),e.type?r.getBlob(e.type):r.getBlob()}function o(t,e){return new Blob(n(t),e||{})}var i="undefined"!=typeof i?i:"undefined"!=typeof WebKitBlobBuilder?WebKitBlobBuilder:"undefined"!=typeof MSBlobBuilder?MSBlobBuilder:"undefined"!=typeof MozBlobBuilder&&MozBlobBuilder,s=function(){try{var t=new Blob(["hi"]);return 2===t.size}catch(e){return!1}}(),a=s&&function(){try{var t=new Blob([new Uint8Array([1,2])]);return 2===t.size}catch(e){return!1}}(),c=i&&i.prototype.append&&i.prototype.getBlob;"undefined"!=typeof Blob&&(r.prototype=Blob.prototype,o.prototype=Blob.prototype),t.exports=function(){return s?a?Blob:o:c?r:void 0}()},function(t,e){e.encode=function(t){var e="";for(var n in t)t.hasOwnProperty(n)&&(e.length&&(e+="&"),e+=encodeURIComponent(n)+"="+encodeURIComponent(t[n]));return e},e.decode=function(t){for(var e={},n=t.split("&"),r=0,o=n.length;r0);return e}function r(t){var e=0;for(p=0;p';i=document.createElement(e)}catch(t){i=document.createElement("iframe"),i.name=o.iframeId,i.src="javascript:0"}i.id=o.iframeId,o.form.appendChild(i),o.iframe=i}var o=this;if(!this.form){var i,s=document.createElement("form"),a=document.createElement("textarea"),c=this.iframeId="eio_iframe_"+this.index;s.className="socketio",s.style.position="absolute",s.style.top="-1000px",s.style.left="-1000px",s.target=c,s.method="POST",s.setAttribute("accept-charset","utf-8"),a.name="d",s.appendChild(a),document.body.appendChild(s),this.form=s,this.area=a}this.form.action=this.uri(),r(),t=t.replace(p,"\\\n"),this.area.value=t.replace(u,"\\n");try{this.form.submit()}catch(h){}this.iframe.attachEvent?this.iframe.onreadystatechange=function(){"complete"===o.iframe.readyState&&n()}:this.iframe.onload=n}}).call(e,function(){return this}())},function(t,e,n){function r(t){var e=t&&t.forceBase64;e&&(this.supportsBinary=!1),this.perMessageDeflate=t.perMessageDeflate,this.usingBrowserWebSocket=o&&!t.forceNode,this.protocols=t.protocols,this.usingBrowserWebSocket||(l=i),s.call(this,t)}var o,i,s=n(23),a=n(24),c=n(32),u=n(33),p=n(34),h=n(3)("engine.io-client:websocket");if("undefined"!=typeof WebSocket?o=WebSocket:"undefined"!=typeof self&&(o=self.WebSocket||self.MozWebSocket),"undefined"==typeof window)try{i=n(37)}catch(f){}var l=o||i;t.exports=r,u(r,s),r.prototype.name="websocket",r.prototype.supportsBinary=!0,r.prototype.doOpen=function(){if(this.check()){var t=this.uri(),e=this.protocols,n={agent:this.agent,perMessageDeflate:this.perMessageDeflate};n.pfx=this.pfx,n.key=this.key,n.passphrase=this.passphrase,n.cert=this.cert,n.ca=this.ca,n.ciphers=this.ciphers,n.rejectUnauthorized=this.rejectUnauthorized,this.extraHeaders&&(n.headers=this.extraHeaders),this.localAddress&&(n.localAddress=this.localAddress);try{this.ws=this.usingBrowserWebSocket&&!this.isReactNative?e?new l(t,e):new l(t):new l(t,e,n)}catch(r){return this.emit("error",r)}void 0===this.ws.binaryType&&(this.supportsBinary=!1),this.ws.supports&&this.ws.supports.binary?(this.supportsBinary=!0,this.ws.binaryType="nodebuffer"):this.ws.binaryType="arraybuffer",this.addEventListeners()}},r.prototype.addEventListeners=function(){var t=this;this.ws.onopen=function(){t.onOpen()},this.ws.onclose=function(){t.onClose()},this.ws.onmessage=function(e){t.onData(e.data)},this.ws.onerror=function(e){t.onError("websocket error",e)}},r.prototype.write=function(t){function e(){n.emit("flush"),setTimeout(function(){n.writable=!0,n.emit("drain")},0)}var n=this;this.writable=!1;for(var r=t.length,o=0,i=r;o0&&t.jitter<=1?t.jitter:0,this.attempts=0}t.exports=n,n.prototype.duration=function(){var t=this.ms*Math.pow(this.factor,this.attempts++);if(this.jitter){var e=Math.random(),n=Math.floor(e*this.jitter*t);t=0==(1&Math.floor(10*e))?t-n:t+n}return 0|Math.min(t,this.max)},n.prototype.reset=function(){this.attempts=0},n.prototype.setMin=function(t){this.ms=t},n.prototype.setMax=function(t){this.max=t},n.prototype.setJitter=function(t){this.jitter=t}}])});
+//# sourceMappingURL=socket.io.js.map
\ No newline at end of file
diff --git a/sitemap.xml b/sitemap.xml
index bbe3039..9e9137e 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,166 +2,186 @@
https://github.com/UBC-STAT/stat-406/schedule/handouts/lab00-git.html
- 2023-10-30T17:41:01.448Z
+ 2023-11-02T23:23:30.405Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/22-nnets-estimation.html
- 2023-10-30T17:40:59.100Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/27-kmeans.html
+ 2023-11-02T23:23:28.201Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/20-boosting.html
- 2023-10-30T17:40:57.956Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/25-pca-issues.html
+ 2023-11-02T23:23:26.873Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/18-the-bootstrap.html
- 2023-10-30T17:40:56.436Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/23-nnets-other.html
+ 2023-11-02T23:23:25.557Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/16-logistic-regression.html
- 2023-10-30T17:40:54.748Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/21-nnets-intro.html
+ 2023-11-02T23:23:24.045Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/14-classification-intro.html
- 2023-10-30T17:40:53.376Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/19-bagging-and-rf.html
+ 2023-11-02T23:23:22.825Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/12-why-smooth.html
- 2023-10-30T17:40:51.824Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/17-nonlinear-classifiers.html
+ 2023-11-02T23:23:21.265Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/10-basis-expansions.html
- 2023-10-30T17:40:50.640Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/15-LDA-and-QDA.html
+ 2023-11-02T23:23:19.765Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/08-ridge-regression.html
- 2023-10-30T17:40:49.084Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/13-gams-trees.html
+ 2023-11-02T23:23:18.349Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/06-information-criteria.html
- 2023-10-30T17:40:47.432Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/11-kernel-smoothers.html
+ 2023-11-02T23:23:17.173Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/04-bias-variance.html
- 2023-10-30T17:40:45.832Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/09-l1-penalties.html
+ 2023-11-02T23:23:15.708Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/02-lm-example.html
- 2023-10-30T17:40:44.224Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/07-greedy-selection.html
+ 2023-11-02T23:23:14.032Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-version-control.html
- 2023-10-30T17:40:42.920Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/05-estimating-test-mse.html
+ 2023-11-02T23:23:12.500Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-quiz-0-wrap.html
- 2023-10-30T17:40:40.188Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/03-regression-function.html
+ 2023-11-02T23:23:10.860Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-gradient-descent.html
- 2023-10-30T17:40:38.876Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/01-lm-review.html
+ 2023-11-02T23:23:09.332Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-classification-losses.html
- 2023-10-30T17:40:36.448Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-r-review.html
+ 2023-11-02T23:23:07.596Z
- https://github.com/UBC-STAT/stat-406/course-setup.html
- 2023-10-30T17:40:33.627Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-intro-to-class.html
+ 2023-11-02T23:23:05.440Z
- https://github.com/UBC-STAT/stat-406/computing/windows.html
- 2023-10-30T17:40:31.655Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-cv-for-many-models.html
+ 2023-11-02T23:23:03.976Z
- https://github.com/UBC-STAT/stat-406/computing/mac_x86.html
- 2023-10-30T17:40:29.975Z
+ https://github.com/UBC-STAT/stat-406/schedule/index.html
+ 2023-11-02T23:23:01.612Z
- https://github.com/UBC-STAT/stat-406/computing/index.html
- 2023-10-30T17:40:28.383Z
+ https://github.com/UBC-STAT/stat-406/syllabus.html
+ 2023-11-02T23:22:59.916Z
- https://github.com/UBC-STAT/stat-406/index.html
- 2023-10-30T17:40:26.495Z
+ https://github.com/UBC-STAT/stat-406/computing/ubuntu.html
+ 2023-11-02T23:22:57.748Z
+
+
+ https://github.com/UBC-STAT/stat-406/computing/mac_arm.html
+ 2023-11-02T23:22:56.168Zhttps://github.com/UBC-STAT/stat-406/faq.html
- 2023-10-30T17:40:27.951Z
+ 2023-11-02T23:22:54.948Z
- https://github.com/UBC-STAT/stat-406/computing/mac_arm.html
- 2023-10-30T17:40:29.183Z
+ https://github.com/UBC-STAT/stat-406/index.html
+ 2023-11-02T23:22:53.472Z
- https://github.com/UBC-STAT/stat-406/computing/ubuntu.html
- 2023-10-30T17:40:30.731Z
+ https://github.com/UBC-STAT/stat-406/computing/index.html
+ 2023-11-02T23:22:55.376Z
- https://github.com/UBC-STAT/stat-406/syllabus.html
- 2023-10-30T17:40:32.963Z
+ https://github.com/UBC-STAT/stat-406/computing/mac_x86.html
+ 2023-11-02T23:22:56.960Z
- https://github.com/UBC-STAT/stat-406/schedule/index.html
- 2023-10-30T17:40:34.655Z
+ https://github.com/UBC-STAT/stat-406/computing/windows.html
+ 2023-11-02T23:22:58.644Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-cv-for-many-models.html
- 2023-10-30T17:40:38.072Z
+ https://github.com/UBC-STAT/stat-406/course-setup.html
+ 2023-11-02T23:23:00.572Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-intro-to-class.html
- 2023-10-30T17:40:39.568Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-classification-losses.html
+ 2023-11-02T23:23:02.372Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/00-r-review.html
- 2023-10-30T17:40:41.760Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-gradient-descent.html
+ 2023-11-02T23:23:04.760Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/01-lm-review.html
- 2023-10-30T17:40:43.572Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-quiz-0-wrap.html
+ 2023-11-02T23:23:06.068Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/03-regression-function.html
- 2023-10-30T17:40:45.120Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/00-version-control.html
+ 2023-11-02T23:23:08.724Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/05-estimating-test-mse.html
- 2023-10-30T17:40:46.704Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/02-lm-example.html
+ 2023-11-02T23:23:09.980Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/07-greedy-selection.html
- 2023-10-30T17:40:48.224Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/04-bias-variance.html
+ 2023-11-02T23:23:11.600Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/09-l1-penalties.html
- 2023-10-30T17:40:49.968Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/06-information-criteria.html
+ 2023-11-02T23:23:13.240Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/11-kernel-smoothers.html
- 2023-10-30T17:40:51.356Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/08-ridge-regression.html
+ 2023-11-02T23:23:14.892Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/13-gams-trees.html
- 2023-10-30T17:40:52.560Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/10-basis-expansions.html
+ 2023-11-02T23:23:16.420Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/15-LDA-and-QDA.html
- 2023-10-30T17:40:54.068Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/12-why-smooth.html
+ 2023-11-02T23:23:17.633Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/17-nonlinear-classifiers.html
- 2023-10-30T17:40:55.592Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/14-classification-intro.html
+ 2023-11-02T23:23:19.093Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/19-bagging-and-rf.html
- 2023-10-30T17:40:57.212Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/16-logistic-regression.html
+ 2023-11-02T23:23:20.457Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/21-nnets-intro.html
- 2023-10-30T17:40:58.488Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/18-the-bootstrap.html
+ 2023-11-02T23:23:22.085Z
- https://github.com/UBC-STAT/stat-406/schedule/slides/23-nnets-other.html
- 2023-10-30T17:41:00.000Z
+ https://github.com/UBC-STAT/stat-406/schedule/slides/20-boosting.html
+ 2023-11-02T23:23:23.565Z
+
+
+ https://github.com/UBC-STAT/stat-406/schedule/slides/22-nnets-estimation.html
+ 2023-11-02T23:23:24.709Z
+
+
+ https://github.com/UBC-STAT/stat-406/schedule/slides/24-pca-intro.html
+ 2023-11-02T23:23:26.221Z
+
+
+ https://github.com/UBC-STAT/stat-406/schedule/slides/26-pca-v-kpca.html
+ 2023-11-02T23:23:27.537Z
+
+
+ https://github.com/UBC-STAT/stat-406/schedule/slides/28-hclust.html
+ 2023-11-02T23:23:28.969Z