04_Classification.Rmd

# Classification

**Learning objectives:**

- Compare and contrast **classification** with linear regression.
- Perform classification using **logistic regression**.
- Perform classification using **linear discriminant analysis (LDA)**.
- Perform classification using **quadratic discriminant analysis (QDA)**.
- Perform classification using **naive Bayes**.
- Identify the **strengths and weaknesses** of the various classification models.
- Model count data using **Poisson regression**.


## An Overview of Classification

- **Classification**: Approaches to make inference and/or predict qualitative (categorical) response variable 

- Few common classification techniques (classifiers): 
  - logistic regression
  - linear discriminant analysis (LDA)
  - quadratic discriminant analysis (QDA)
  - naive Bayes
  - K-nearest neighbors

<br> 
- **Examples of classification problems: **
<br>

  1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
  
  - Predictor variable: Symptoms
  - Response variable: Type of medical conditions
<br>  
  2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
  
  - Predictor variable: User's IP address, past transaction history, etc
  - Response variable: Fraudulent activity (Yes/No)
<br>  
  3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
  
  - Predictor variable: DNA sequence data
  - Response variable: Presence of deleterious gene (Yes/No)
<br>

- In the following section, we are going to explore the `Default` dataset. The annual incomes ($X_1$ = `income`) and monthly credit card balances ($X_2$ =`balance`) are used to predict whether whether an individual will default on his or her credit card payment. 

```{r fig4-1, cache=TRUE, echo=FALSE, fig.align="center", fig.cap="The distribution of balance and income split by the binary default variable respectively; Note. Defaulters represented as orange plus sign; non-defaulters represented as blue circle"}

knitr::include_graphics("./images/fig4_1.jpg", error = FALSE)

```

  
## Why NOT Linear Regression?

- a regression method cannot convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression

$$Y = \left\{ \begin{array}{ll}
          1 & \mbox{if stroke};\\
          2 & \mbox{if epileptic seizure};\\
          3 & \mbox{if drug overdose}.\end{array} \right.$$

- Depending on the complexity of the problem, a regression method will not provide meaningful estimates of Pr(Y |X); 

- There are times that a binary *qualitative* responses can be modeled using *dummy variables* approach. Example:

$$Y = \left\{ \begin{array}{ll}
          0 & \mbox{if stroke};\\
          1 & \mbox{if drug overdose}.\end{array} \right.$$

  - in such cases, the prediction of $\hat{Y} > 0.5$, can be associated with \mbox{drug overdose}.
  - The main issue is partial estimates might be outside the [0, 1] probability interval, e.g. fig4-2:

```{r fig4-2, cache=TRUE, echo=FALSE, fig.align="center", fig.cap="Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right: Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1."}

knitr::include_graphics("./images/fig4_2.jpg", error = FALSE)
```


## Logistic Regression

### The Logistic Model
- **Logistic regression**: models the probability that Y belongs to a particular category (X) 
- X is binary (0/1)

$$p(X) = β_0 + β_1X \space \Longrightarrow {Linear \space regression}$$
$$p (X) =  \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
$$odds = \frac{p (X)}{1 - p (X)} =  e^{\beta_{0} + \beta_{1}X} \Longrightarrow {odds \space value [0, ∞]}$$

By logging the whole equation, we get

$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$

### Estimating the Regression Coefficient
To estimate the regression coefficient, we use **maximum likelihood (ME)**. 


***Likelihood Function***

$$ℓ (\beta_{0}, \beta_{1}) = \prod_{i: y_{i}= 1} p (x_i)  \prod_{i': y_{i'}= 0} (1- p (x_{i'})) \Longrightarrow {Likelihood \space function}$$

- The aim is to find beta values such that $$ℓ$$ is maximum. 
- The Least square method is the special case of maximum likelihood function. 

### Multiple Logistic Regression

$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p \\ \Downarrow \\ p(X) = \frac{e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}{1 + \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}$$

```{r fig4-3, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown."}

knitr::include_graphics("./images/fig4_3.jpg", error = FALSE)
```


### Multinomial Logistic Regression

- This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.

- However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline. 

- Alternatively, you can use `Softmax coding, where we _treat all K classes symmetrically_, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes. 

## Generative Models for Classification

**Why Logistic Regression is not ideal?**

- When there is substantial separation between the two classes, the
parameter estimates for the logistic regression model are surprisingly
unstable. 

- If the distribution of the predictors X is approximately normal in
each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.

- Generative modelling can be naturally extended to the case
of more than two response classes. 

<br>
**Common notations:**
<br>
- K $\Longrightarrow$ response class

- $π_k \Longrightarrow$ overall or prior probability that a randomly chosen observation comes from the prior kth class; can be obtained from the random
sample from the population

- $f_k(X) ≡ Pr(X|Y = k)^1 \Longrightarrow$ the density function of X density for an observation that comes from the kth class; requires some underlying assumption to estimate 

<br>

Bayes’ theorem states that

$$Pr(Y = k|X = x) = \frac {π_k f_k(x)}{\sum_{l =1}^{k} π_lf_l(x)}$$

- $p_k(x) = Pr(Y = k|X = x) \Longrightarrow$  _posterior probability_ that an observation posterior X = x belongs to the kth class; computed from $f_k(X)$

## A Comparison of Classification Methods

Each of the classifiers below uses different estimates of $f_k(x)$. 

- linear discriminant analysis;
- quadratic discriminant analysis; 
- naive Bayes

### Linear Discriminant Analysis for p = 1
- one predictor
- classify an observation to the class for which $p_k(x)$ is greatest

**Assumptions:**
- we assume that $f_k(x)$ is normal or Gaussian with a classs pecific
mean and,
- a shared variance term across all K classes [$σ^2_1 = · · · = σ^2_K$ ]

The normal density takes the form

$$f_k(x) = \frac{1}{\sqrt{2πσ_k}}exp(- \frac{1}{2σ^2_k}(x- \mu_k)^2)$$

Then, the posterior probability (probability that the observation belongs to the kth class, given the predictor value for that observation) is

$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_l)^2)}$$

**Additional mathematical formula**

After you log and rearrange the above equation, you will the following formula. The Bayes' classifier assign to one class if $2x (μ_1 − μ_2) > μ_1^2 − μ_2^2$ and otherwise. 

$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$

The Bayes decision boundary is the point for which $δ_1(x) = δ_2(x)$

$$x = \frac{μ_1^2 − μ_2^2}{2(μ_1 − μ_2)} = \frac{μ_1 + μ_2}{2}$$

```{r fig4-4, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: Two one-dimensional normal density functions are shown. The dashed vertical line represents the Bayes decision boundary. Right: 20 observations were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data."}

knitr::include_graphics("./images/fig4_4.jpg", error = FALSE)

```

The **linear discriminant analysis (LDA)** method approximates the linear
discriminant analysis Bayes classifier by plugging estimates for $π_k$, $μ_k$, and $σ^2$ into equation 4.18. 

$\hat μ_k$ is the average of all the training observations from the kth class
$$\hat{\mu}_{k} = \frac{1}{n_{k}}\sum_{i: y_{i}= k} x_{i}$$

$\hat σ^2$ is the weighted average of the sample variances for each of the K classes

$$\hat{\sigma}^2 = \frac{1}{n - K} \sum_{k = 1}^{K} \sum_{i: y_{i}= k} (x_{i} - \hat{\mu}_{k})^2$$

Note. 
n = total number of training observations, 
$n_k$ = number of training observations in the kth class

$π_k$ is estimated from the proportion of the training observations
that belong to the kth class.

$$π_k = \frac{n_k}{n}$$

LDA classifier assigns an observation X = x to the class for which $δ_k(x)$ is largest. 

$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18} \\ \Downarrow \\ \hat δ_k(x) = x \cdot \frac{\hat \mu_k}{\hat \sigma^2} - \frac{\hat \mu_k^2}{2\hat \sigma^2} + log(\hat π_k)$$

### Linear Discriminant Analysis for p > 1

- multiple predictors; p > 1 predictors

- observations come from a multivariate Gaussian (or multivariate normal) distribution, with a **class-specific mean vector** and a common **covariance matrix**; $$N(μ_k,Σ)$$

**Assumptions: **
- each individual predictor follows a one-dimensional normal distribution, with predictors having some correlation


```{r fig4-5, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated and it has a circular base. Var(X_1) = Var(X_2) and Cor(X_1,X_2) = 0; Right: The two variables have a correlation of 0.7 with a elliptical base"}

knitr::include_graphics("./images/fig4_5.jpg", error = FALSE)

```

$\exp$
The multivariate Gaussian density is defined as: 

$$f(x) = \frac{1}{(2π)^{\frac{p}{2}}|Σ|^{\frac{1}{2}}}\exp -\frac{1}{2}(x - \mu)^T Σ^{−1}(x − μ))$$


Bayes classifier assigns an observation X = x to the class for which $$δ_k(x)$$ is largest. 

$$δ_k(x) =  x^T Σ^{−1}μ_k - \frac{1}{2}μ_k^T Σ^{−1} μ_k + log π_k \Longrightarrow vector/matrix \space version \\ δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$

```{r fig4-6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "An example with three classes. The observations from each class are drawn from a multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines. The Bayes decision boundaries are once again shown as dashed lines. Overall, the LDA decision boundaries are pretty close to the Bayes decision boundaries, shown again as dashed lines. The test error rates for the Bayes and LDA classifiers are 0.0746 and 0.0770, respectively."}

knitr::include_graphics("./images/fig4_6.jpg", error = FALSE)

```

All classification models have training error rate, which can be displayed with a **confusion matrix**. 

**Caveats of error rate: **

- training error rates will usually be lower than test error rates, which are the real quantity of interest. The higher the ratio of parameters _p_ to number of samples n, the more we expect this _overfitting_ to play a role.

- the trivial null classifier will achieve an error rate that is only a bit higher than the LDA training set error rate

- a binary classifier such as this one can make two types of errors (Type I and II)

- Class-specific performance _(sensitivity and specificity)_ is important in certain fields (e.g., medicine)


LDA has low sensitivity due to 
1. LDA is trying to approximate the Bayes classifier, which has the lowest
total error rate out of all classifiers
2. In the process, the Bayes classifier will yield the smallest possible total number of misclassified observations, regardless of the class from which the errors stem. 
3. It also uses a threshold of 50% for the posterior probability of default in order to assign an observation to the default class

$$Pr(default = Yes|X = x) > 0.5. \\ Pr(default = Yes|X = x) > 0.2.$$

```{r fig4-7, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The figure illustrates the trade-off that results from modifying the threshold value for the posterior probability of default. For the Default data set, error rates are shown as a function of the threshold value for the posterior probability that is used to perform the assignment. The black solid line displays the overall error rate. The blue dashed line represents the fraction of defaulting customers that are incorrectly classified, and the orange dotted line indicates the fraction of errors among the non-defaulting customers."}

knitr::include_graphics("./images/fig4_7.jpg", error = FALSE)

```

- As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases. The decision on the threshold must be based on **domain knowledge** (e.g., detailed information about the costs associated with default)

- ROC curve is a way to illustrate the two type of errors at all possible thresholds. 

```{r fig4-8, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the “no information” classifier; this is what we would expect if student status and credit card balance are not associated with probability of default."}

knitr::include_graphics("./images/fig4_8.jpg", error = FALSE)

```

An ideal ROC curve will hug the top left corner, so the larger **area under the ROC curve (AUC)**, the better the classifier. 

```{r tbl4_6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap="Possible results when applying a classifier or diagnostic test to a population"}

library("htmlTable")
library("magrittr")

matrix(c("True Neg. (TN)", "False Pos. (FP)", "N", "False Neg. (FN)", "True Pos. (TP)", "P", "N∗", "P∗", ""),
       ncol = 3,
       dimnames = list("Predicted class" = c(" − or Null", " + or Non-null", "Total"),
                       "True class" = c("Neg. or Null", "Pos. or Non-null", "Total"))) %>% 
  addHtmlTableStyle(align = "lcr") %>% 
  htmlTable

```

Important measures for classification and diagnostic testing:

- **False Positive rate (FP/N)** $\Longrightarrow$ Type I error, 1−Specificity

- **True Positive rate (TP/P)** $\Longrightarrow$ 1−Type II error, power, sensitivity, recall

- **Pos. Predicted value (TP/P∗)** $\Longrightarrow$ Precision, 1−false discovery proportion

- **Neg. Predicted value (TN/N∗)**


### Quadratic Discriminant Analysis (QDA)

- Assumptions similar to LDA, in which observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction

- QDA assumes that each class has its own covariance matrix

$$X ∼ N(μ_k,Σ_k) \Longrightarrow {Σ_k is \space covariance \space matrix \space for \space the \space kth \space class}$$

**Bayes classifier**

$$δ_k(x) = - \frac{1}{2}(x - \mu_k)^T Σ_k^{−1}(x - \mu_k) - \frac{1}{2}log|Σ_k| + log(π_k) \\ \Downarrow \\ δ_k(x) =  - \frac{1}{2}x^T Σ_k^{−1}x - x^T Σ_k^{−1} \mu_k - \frac{1}{2}μ_k^T Σ_k^{−1} μ_k - \frac{1}{2}log|Σ_k| + log π_k$$

QDA classifier involves plugging estimates for **$Σ_k$, $μ_k$, and $π_k$** into the above equation, and then assigning an observation X = x to the class for which this quantity is **largest**. 

The quantity x appears as a quadratic function, hence the name. 

<br>
**Why the LDA to QDA is preferred or vice-versa?**
<br>
1. **Bias-variance trade-off** 
<br>
  - Pro LDA: LDA assumes that the K classes share a common covariance matrix and the quantity X becomes linear, which means there are $K_p$ linear coefficients to estimate.LDA is a much less flexible classifier than QDA, and so has substantially *lower variance*; improved prediction performance.
  
  - Con LDA: If the assumption K classes share a common covariance matrix is badly off, LDA can suffer from *high bias*
  
  - Conclusion: Use LDA when there is a few training observations; use QDA when the training set is very large or common covariance matrix is untennable.
  

```{r fig4-9, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem with Σ1 = Σ2. The shading indicates the QDA decision rule. Since the Bayes decision boundary is linear, it is more accurately approximated by LDA than by QDA. Right: Details are as given in the left-hand panel, except that Σ1 ̸= Σ2. Since the Bayes decision boundary is non-linear, it is more accurately approximated by QDA than by LDA."}

knitr::include_graphics("./images/fig4_9.jpg", error = FALSE)

```


### Naive Bayes

- Estimating a p-dimensional density function is challenging; naive bayes make a different assumption than LDA and QDA.
- an alternative to LDA that does not assume normally distributed
predictors

$$f_k(x) = f_{k1}(x_1) × f_{k2}(x_2)×· · ·×f{k_p}(x_p),$$
where $f_{kj}$ is the density function of the jth predictor among observations in the kth class 


*Within the kth class, the p predictors are independent.*

**Why naive Bayes is better/powerful?**

1. By assuming that the p covariates are independent within each class, we assumed that there is no association between the predictors! When estimating a p-dimensional density function, it is difficult to calculate the *marginal distribution* of each predictor and *joint distribution* of the predictors. 

2. Although p covariates might not be independent within each class, it is convenient and we obtain pretty decent results when the n is small, p is large. 

3. It reduces variance, though it has some bias (Bias-variance trade-off)


**Options to estimate the one-dimensional density function fkj using training data**

1. [For Quantitative $X_j$] -> We assume $X_j |Y = k ∼ N(μ_{jk},σ_{jk}^2)$, where within each class, the jth predictor is drawn from a (univariate) normal distribution. It is **QDA-like with diagonal class-specific covariance matrix**

2. [For Quantitative $X_j$] -> Use a *non-parametric estimate* for $f_{kj}$. First, a histogram for the within-class observations and then estimate $f_{kj}(x_j)$. Or else, use **kernel density estimator**. 

3. [For Qualitative $X_j$] ->Count the proportion of training observations for the jth predictor corresponding to each class. 

Note: Fixing the threshold, the Naive Bayes has a higher error rate than LDA, but better prediction (higher sensitivity). 


## Summary of the classification methods

### An Analytical Comparison

- **LDA** and **logistic regression** assume that the log odds of the posterior probabilities is _linear_ in x.

- **QDA** assumes that the log odds of the posterior probabilities is _quadratic_ in x.

- **LDA** is simply a restricted version of QDA with $Σ_1 = · · · = Σ_K = Σ$

- **LDA** is a special case of naive Bayes and vice-versa!

- **LDA** assumes that the features are normally distributed with a common within-class covariance matrix, and naive Bayes instead assumes _independence_ of the features.

- **Naive Bayes** can produce a more _flexible_ fit. 

- **QDA** might be more accurate in settings where interactions among the predictors are important in discriminating between classes. 

- **LDA > logistic regression** when the observations at each Kth class is normal. 

- **K-nearest neighbors (KNN)** will be better classifiers when decision boudary is non-linear, n is large, and p is small. 

- **KNN** has low bias but large variance; as such, KNN requires a lot of observations relative to the number of predictors. 

- If decision boundary is non-linear but n is and p are small, then QDA may be preferred to KNN. 

- KNN does not tell us which predictors are important!


<br>

_Final note._ The choice of method depends on (1) the true distribution of the predictors in each of the K classes,(2) the values of n and p - bias-variance trade-off


### An Empirical Comparison

```{r fig4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the linear scenarios described in the main text."}

knitr::include_graphics("./images/fig4_11.jpg", error = FALSE)

```


**When Bayes decision boundary is linear,**

_Scenario 1_: Binary class response, equal observations in each class, uncorrelated predictors

_Scenario 2_: Similar to Scenario 1, but the predictors had a correlation of −0.5.

_Scenario 3_: Predictors had a negative correlation, t-distribution (more extreme points at the tails)

```{r fig4-12, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the non-linear scenarios described in the main text"}

knitr::include_graphics("./images/fig4_12.jpg", error = FALSE)

```


**When Bayes decision boundary is non-linear,**

_Scenario 4_: normal distiibution, correlation of 0.5 between the predictors in the first class, and correlation of −0.5 between the predictors in the second class.

_Scenario 5_: Normal distribution, uncorrelated predictors

_Scenario 6_: Normal distribution, different diagonal covariance matrix for each class, small n


## Generalized Linear Models

**Count data** (e.g. number of bikers per hour) is neither quantitative nor qualitative

=> neither linear regression nor the classification approaches considered so far are applicable.


## Linear regression with count data - negative values

The results of fitting a least squares regression model to the `Bikeshare` data provides some reasonable results: 

* as weather progressively worsens, the number of bikers decreases (_coefficients become negative wrt baseline_)

* the coefficients associated with season and time of day match expected patterns (_lowest in winter, and highest during peak commute times_)


```{r tab4-10, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Results for a least squares linear model fit to predict bikers in the Bikeshare data. For the qualitative variable weathersit, the baseline level corresponds to clear skies._"}

knitr::include_graphics("./images/tab4_10.jpg", error = FALSE)

```


```{r fig4-13, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_A least squares linear regression model was fit to predict bikers in the Bikeshare data set. Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight._"}

knitr::include_graphics("./images/fig4_13.jpg", error = FALSE)

```


***Problem 1***: <mark>*model predicts negative numbers of bikers at times*</mark>


## Linear regression with count data - heteroscedasticity

In this example, the variance of biker numbers changes as the mean number changes:

* during worse conditions, there are few bikers, and little variation in the number of bikers

* during better conditions, there are many bikers on average, but also larger variation in the number of bikers


```{r fig4-14, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers. A smoothing spline fit is shown in green. Right: The log of the number of bikers is displayed on the y-axis._"}

knitr::include_graphics("./images/fig4_14.jpg", error = FALSE)

```


***Problem 2***: <mark>*observed heteroscedasticity is a violation of linear model assumptions*</mark>

$$Y = \beta_{0} + \sum_{j=1}^p \beta_{j} + \epsilon$$ 

where $\epsilon$ is a mean-zero error term with a constant variance


Transforming to log improves the variance, but cannot be used where the response can take on a 0 value. 

Log transformation also results in challenges in interpretation:

e.g. “_a one-unit increase in $X_j$ is associated with an increase in the mean of the log of $Y$ by an amount $β_j$_”


## Problems with linear regression of count data


***Problem 1***: <mark>*model predicts negative numbers of bikers at times*</mark>


***Problem 2***: <mark>*observed heteroscedasticity is a violation of linear model assumptions*</mark>


***Problem 3***: <mark>*integer values (bikers) predicted using a continuous response $Y$*</mark>


"_[A] Poisson regression model provides a much more natural and elegant approach for this task._"


## Poisson distribution

A count response variable $Y$ (which takes on non-negative integer values) can be modeled using the **Poisson distribution**, where the probability that $Y$ takes on a given count value $k$ can be calculated as:

$Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$  for $k$ = 0, 1, 2, ...


where $\lambda$ represents both the expected value (mean) and variance of $Y$:

$Y = E(Y) = Var(Y)$

=> "_[I]f $Y$ follows the Poisson distribution, then the larger the mean of $Y$, the larger its variance._"


```{r fig.cap= "_Plots of Poisson Distributions with different lambda values, showing how variance increases with increasing lambda. Note all values are non-negative integer values, suitable for modelling counts, k._"}

par(mfrow = c(2,2))
lambda <- c(1:4)
k <- c(0:10)
for (lam in lambda) {
  Prk <- (exp(-lam)*lam^k)/factorial(k)
  plot(k, Prk, type = 'b', ylim = c(0, 0.4), main = paste("lambda =", lam))
}

```


## Poisson Regression Model mean (lambda)

"_[R]ather than modeling [a count response variable], $Y$, as a Poisson distribution with a fixed mean value like $\lambda$ = 5, we would like to allow the mean to vary as a function of the covariates._"


The mean $\lambda$ can be modeled as a function of the predictor variables as follows:

$log(\lambda(X_1, ..., X_p) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$

NB: taking the log ensures that $\lambda$ can only be non-negative.


This is equivalent to representing the mean $\lambda$ as follows:

$\lambda = \text{E}(Y) = \lambda(X_1, ..., X_p) = e^{\beta_0 + \beta_1X_1 + ... + \beta_pX_p}$


## Estimating the Poisson Regression parameters

The calculation of $\lambda$ can then be used in the formula of the Poisson Distribution, allowing the Maximum Likelihood approach to be used in estimating the parameters, $\beta_0$, $\beta_1$,..., $\beta_p$:


Poisson Distribution Formula: $Pr(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!}$  for $k$ = 0, 1, 2, ...


Maximum likelihood: $l(\beta_0, \beta_1, ..., \beta_p) = \Pi_{i=1}^n\frac{e^{-\lambda(x_i)}\lambda(x_i)^{y_i}}{y_i!}$

where $\lambda(x_i) = e^{\beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip}}$


Coefficients that maximize the likelihood $l(\beta_0, \beta_1, ..., \beta_p)$ (make the observed data as likely as possible) are chosen.


## Interpreting Poisson Regression

An increase in $X_j$ by one unit is associated with a change in $E(Y) = \lambda$ by a factor of $exp(\beta_j)$

```{r tab4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "_Results for Poisson regression model fit to predict bikers in the Bikeshare data. For the qualitative variable weathersit, the baseline level corresponds to clear skies._"}

knitr::include_graphics("./images/tab4_11.jpg", error = FALSE)

```

A change in weather from clear to cloudy skies is associated with a change in mean bike usage by a factor of 

exp(-0.08) = 0.923 

i.e. on average, only 92.3% as many people will use bikes compared to when it is clear (baseline weather).


## Advantages of Poisson Regression

Poisson regression has several advantages in modeling count data:

**Mean-variance relationship** We implicitly assume that mean bike usage in a given hour equals the variance of bike usage during that hour (cf use constant variance in linear regression).

**Non-negative fitted values** There are no negative predictions using the Poisson regression model.


## Generalized Linear Models


Generalized linear models (GLMs) all follow the same 'recipe':

* use a set of predictors $X_1$, ..., $X_p$ to predict a response $Y$

* model the response $Y$ as coming from a particular distribution

e.g. Poisson Distribution, for Poisson regression

* transform the mean of the response (via a _link function_ $\eta$) so that the transformed mean is a linear function of the predictors

e.g. for Poisson regression, $log(\lambda(X_1, ..., X_p) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$


## Addendum - Logistic Regression Assumptions

```{r}
library(dplyr)
library(titanic)
library(car)
```

Source: [The 6 Assumptions of Logistic Regression (With Examples)](https://www.statology.org/assumptions-of-logistic-regression/)

Source: [Assumptions of Logistic Regression, Clearly Explained](https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290)

**Logistic regression** is a method to fit a regression model usually when the response variable is binary.

#### Assumption #1 - The response variable is binary

Examples:

- Yes or No

- Male or Female

- Pass or Fail

For more tha two possible outcomes, an **ordinal regression** is the model of choice.

#### Assumption #2 - Observations are independent

As the OLS regression, the logistic regression requires that observations are iid (independent and identical distributed).

Easiest way is to create a plot of residuals against time (i.e. order of observations) and observe if the pattern is random or not.

#### Assumption #3 - No multicollinearity among predictors

Multicollinearity occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the model.

Use `VIF` to check multicollinearity (> 10 show strong collinearity among predcitors)

#### Assumption #4 - No extreme outliers

Logistic regression assumes that there are no extreme outliers or influential observations in the dataset.

Use `Cook's distance` for each observation.

#### Assumption #5 - There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable

Logistic regression assumes that there exists a linear relationship between each explanatory variable and the logit of the response variable. Recall that the logit is defined as:

Logit(p)  = log(p / (1-p)) where p is the probability of a positive outcome.

Use `Box-Tidwell` test to check this assumption.

Example:
```{r}
titanic <- titanic_train %>% 
     select(Survived, Age, Fare) %>% 
     na.omit() %>% 
     janitor::clean_names()

glimpse(titanic)
```

Build the model
```{r}
# survived (target  ~ age + fare)
log_reg <- glm(survived ~ age + fare, data = titanic, family = binomial(link = "logit"))

summary(log_reg)
```

Box_Tidwell test

```{r}
titanic <- titanic %>% 
     mutate(age_1 = age + 1, fare_1 = fare + 1)

boxTidwell(survived ~ age_1 + fare_1, data = titanic)
```

#### Assumption #6 - Sample size must be sufficiently large

Logistic regression assumes that the sample size of the dataset if large enough to draw valid conclusions from the fitted logistic regression model.

As a rule of thumb, you should have a minimum of 10 cases with the least frequent outcome for each explanatory variable. For example, if you have 3 explanatory variables and the expected probability of the least frequent outcome is 0.20, then you should have a sample size of at least (10*3) / 0.20 = 150.


## Lab: Classification Methods


## Exercises

<!--
adapted from https://onmee.github.io/assets/docs/ISLR/Classification.pdf
--> 

### Conceptual

1. Claim: The logistic function representation for logistic regression

$$p (X) =  \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$

is equivalent to the logit function representation for logistic regression.

$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$

Proof:

$$\begin{array}{rcl}
  p(X) & = & \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \\
  p(X)[1 + e^{\beta_{0} + \beta_{1}X}] & = & e^{\beta_{0} + \beta_{1}X} \\
  p(X) + p(X)e^{\beta_{0} + \beta_{1}X} & = & e^{\beta_{0} + \beta_{1}X} \\
  p(X) & = & e^{\beta_{0} + \beta_{1}X} - p(X)e^{\beta_{0} + \beta_{1}X} \\
  p(X) & = & e^{\beta_{0} + \beta_{1}X}[1 - p(X)] \\
  \frac{p(X)}{1 - p(X)} & = & e^{\beta_{0} + \beta_{1}X} \\
  \log \biggl(\frac{p(X)}{1- p(X)}\bigg) & = & \beta_{0} + \beta_{1}X \\
\end{array}$$

2. Under the assumption that the observations in the $k^{th}$ class are drawn from a Gaussian $N(\mu_{k}, \sigma^{2})$ distribution,

$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_l)^2)}$$

is largest when $x = \mu_{k}$ (i.e. observation classified when $x$ is close to $\mu_{k}$).  We can proceed toward the discriminant function $\delta_{k}(x) = \ln C^{-1}p_{k}(x)$ using $C$ as a scaling factor of proportionality

$$\begin{array}{rcl}
  p_k(x) & = & \frac{π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_l)^2)} \\
  p_k(x) & \propto & π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x- \mu_k)^2) \\
  p_k(x) & \propto & π_k \frac{1}{\sqrt{2πσ}}exp(- \frac{1}{2σ^2}(x^{2}- 2\mu_{k}x + \mu_{k}^{2})) \\
  p_k(x) & = & Cπ_k exp(- \frac{1}{2σ^2}(-2\mu_{k}x + \mu_{k}^{2})) \\
  C^{-1}p_k(x) & = & π_k exp(- \frac{1}{2σ^2}(-2\mu_{k}x + \mu_{k}^{2})) \\
  \ln C^{-1}p_{k}(x) & = & \ln \pi_{k} + \frac{\mu_{k}x}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} \\
  \delta_{k}(x) & = & \frac{\mu_{k}x}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} + \ln \pi_{k}
\end{array}$$

where the observation is also classified into the $k^{th}$ class when $x$ is close to $\mu_{k}$.

3.  For QDA, whose observations $X_{k} \sim N(\mu_{k}, \sigma_{k}^{2})$, consider the case with one feature (i.e. $p = 1$).  Prove that the Bayes Classifier is quadratic (i.e. not linear).

In a similar proof as the previous exercise, but without the assumption of the same variance, so each class has its own variance $\sigma_{k}$,

$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσ_{k}}}exp(- \frac{1}{2σ_{k}^2}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσ_{l}}}exp(- \frac{1}{2σ_{l}^2}(x- \mu_l)^2)}$$
and we would arrive at the discriminant function

$$\begin{array}{rcl}
  C^{-1}p_k(x) & = & \frac{\pi_k}{\sigma_{k}} exp(- \frac{1}{2σ_{k}^2}(x^{2}-2\mu_{k}x + \mu_{k}^{2})) \\
  \ln C^{-1}p_{k}(x) & = & \ln\frac{π_k}{\sigma_{k}} - \frac{x^{2}}{2\sigma_{k}^{2}} + \frac{\mu_{k}x}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} \\
  \delta_{k}(x) & = & - \frac{1}{2\sigma_{k}^{2}}x^{2} + \frac{\mu_{k}}{\sigma^{2}}x - \frac{\mu_{k}^{2}}{2\sigma^{2}} + \ln \pi_{k} - \ln\sigma_{k}
\end{array}$$

which is quadratic with respect to $x$.

4. When the number of features $p$ is large, we may encounter the *curse of dimensionality*.

    a) $p = 1, X \sim U(0,1)$, and classifying observations that within 10 percent of the test observation.  On average, we use 10 percent of the available observations.
    
    b) $p = 2$, we would use 1 percent of the observations.
    
    c) $p = 100$, we would use $0.1^{p-2}$ percent of the observations.
    
    d) How reliable would KNN be if there are very few observations used?
    
    e) One idea is to then extend the length of the $p$-dimensional hypercube to find more observations.

5. LDA versus QDA

* If the Bayes decision boundary is linear, QDA may be better on the training set with its flexibility, but will probably be worse on the test set due to higher variance.  Therefore, LDA is advised.

* If the Bayes decision boundary is non-linear, QDA is advised.

6. We model with logistic regression

* $Y$: receive an A
* $X_{1}$: hours studied, $X_{2}$: undergraduate GPA
* coefficients $\hat{\beta}_{0} = -6$, $\hat{\beta}_{1} = 0.05$, $\hat{\beta}_{2} = 1$

a) Estimate the probability that a student who studies for 40 hours and has an undergraduate GPA of 3.5 gets an A in the class.
    
$$Y = \frac{e^{\hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2}} }{1 + e^{\hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2}}} = \frac{e^{-0.5}}{1 + e^{-0.5}} \approx 0.3775$$

b) How many hours would the student in part (a) need to study to have a 50 percent chance of getting an A in the class?
    
$$\begin{array}{rcl}
    \ln\left(\frac{Y}{1 - Y}\right) & = & \hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + \hat{\beta}_{2}X_{2} \\
    \ln\left(\frac{0.5}{1 - 0.5}\right) & = & -6 + 0.05X_{1} + 3.5 \\
    X_{1} & = & 50 \text{ hours} \\
    \end{array}$$
    
7. Predict $Y$ (whether or not a stock will issue a dividend: "Yes" or "No") based on $X$ (last year's percent profit).

* issued dividend: $X \sim N(10, 36)$
* no dividend: $X \sim N(0, 36)$
* P(issue dividend) = 0.80

Using Bayes' Rule

$$\begin{array}{rcl}
    P(Y = \text{yes}|X) & = & \frac{\pi_{\text{yes}}\exp(-\frac{1}{2\sigma^{2}}(x-\mu_{\text{yes}})^{2})}{\sum_{l = 1}^{K} \pi_{l}\exp(-\frac{1}{2\sigma^{2}}(x-\mu_{l})^{2})} \\
    P(Y = \text{yes}|X = 4) & = & \frac{0.8\exp(-0.5)}{0.8\exp(-0.5) + 0.2\exp(-16/72)} \\
    P(Y = \text{yes}|X = 4) & \approx & 0.7519 \\
    \end{array}$$

8. Two models

    1) logistic regression: 30% training error, 20% test data
    2) KNN (K = 1) averaged 18% error over training and test sets
    
    The KNN with K=1 model would fit the training set exactly and so the training error would be zero. This means the test error has to be 36% in order for the average of the errors to be 18%. As model selection is based on performance on the test set, we will choose logistic regression to classify new observations.

9.  About odds

a) If 0.37 probability of defaulting on credit card, then the probability of default is

$$0.37 = \frac{P(X)}{1 - P(X)} \quad\Rightarrow\quad P(X) \approx 0.2701$$

b) If an individual has 16% chance of defaulting, then their odds are

$$\text{odds} = \frac{P(X)}{1 - P(X)} = \frac{0.16}{1 - 0.16} \approx 0.1905$$
### Applied

13. 

```{r}
library("ISLR")

# (a) numerical and graphical summaries
summary(Weekly)
```

```{r}
# scatterplot matrix
pairs(Weekly[,1:8])
```

```{r}
# correlation matrix
round(cor(Weekly[,1:8]),2)
```

```{r}
# (b) logistic regression
logistic_fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
summary(logistic_fit)
```

```{r}
# (c) confusion matrix
logistic_probs = predict(logistic_fit, type="response")
logistic_preds = rep("Down", 1089) # Vector of 1089 "Down" elements.
logistic_preds[logistic_probs>0.5] = "Up" # Change "Down" to up when probability > 0.5.

attach(Weekly)
table(logistic_preds,Direction)
```

$$\text{accuracy} = \frac{54 + 557}{54 + 48 + 430 + 557} \approx 0.5611$$

```{r}
# Training observations from 1990 to 2008.
train = (Year<2009)
# Test observations from 2009 to 2010.
Test = Weekly[!train ,]
Test_Direction= Direction[!train]

# Logistic regression on training set.
logistic_fit2 = glm(Direction ~ Lag2, data=Weekly, family=binomial, subset=train)
# Predictions on the test set.
logistic_probs2 = predict(logistic_fit2,Test, type="response")
logistic_preds2 = rep("Down", 104)
logistic_preds2[logistic_probs2>0.5] = "Up"
# Confusion matrix.
table(logistic_preds2,Test_Direction)
```

$$\text{accuracy} = \frac{9 + 56}{9 + 5 + 34 + 56} = 0.625$$

```{r}
# Using LDA
library("MASS")
lda_fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
#lda_fit
# Predictions on the test set.
lda_pred = predict(lda_fit,Test)
lda_class = lda_pred$class
# Confusion matrix.
table(lda_class,Test_Direction)
```

$$\text{accuracy} = \frac{9 + 56}{9 + 5 + 34 + 56} = 0.625$$

```{r}
# Using QDA.
qda_fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
qda_pred = predict(qda_fit,Test)
qda_class = qda_pred$class
table(qda_class,Test_Direction)
```

```{r}
# Using KNN
library("class")
set.seed(1)
train_X = Weekly[train,3]
test_X = Weekly[!train,3]
train_direction = Direction[train]
# Changing from vector to matrix by adding dimensions
dim(train_X) = c(985,1)
dim(test_X) = c(104,1)
# Predictions for K=1
knn_pred = knn(train_X, test_X, train_direction, k=1)
table(knn_pred, Test_Direction)
```

$$\text{accuracy} = \frac{31 + 31}{21 + 30 + 22 + 31} \approx 0.5962$$

14. Develop a model to predict whether a given car gets high or low gas milage based on the `Auto` data set.

```{r}
# binary variable for "high" versus "low"
# Dataframe with "Auto" data and empty "mpg01" column
df = Auto
df$mpg01 = NA
median_mpg = median(df$mpg)
# Loop
for(i in 1:dim(df)[1]){
  if (df$mpg[i] > median_mpg){
    df$mpg01[i] = 1
  }else{
    df$mpg01[i] = 0
  }
}
```

```{r}
# graphical summary
pairs(df[,c(1:8,10)])
```

```{r}
# correlation matrix
round(cor(df[,c(1:8,10)]),2)
```

```{r}
library('tidyverse')
# split into training and test set
set.seed(123)
df <- df |> 
  mutate(splitter = sample(c("train", "test"), nrow(df), replace = TRUE))

train2 <- df |> filter(splitter == "train")
test2 <- df |> filter(splitter == "test")
```

```{r}
# LDA model
lda_fit3 = lda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)
# Predictions and confusion matrix
lda_pred3 = predict(lda_fit3,test2)
predictions = lda_pred3$class
actual = test2$mpg01
table(predictions,actual)
```

$$\text{test error } = \frac{5+19}{75 + 5 + 19 + 88} \approx 0.1284$$

```{r}
# QDA model
qda_fit2 = qda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)
qda_pred2 = predict(qda_fit2,test2)
predictions = qda_pred2$class
table(predictions,actual)
```

$$\text{test error } = \frac{7+12}{82 + 7 + 12 + 86} \approx 0.1016$$

```{r}
# Logistic regression model
logistic_fit5 = glm(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2, family=binomial)
logistic_probs5 = predict(logistic_fit5,test2, type="response")
logistic_preds5 = rep(0, length(test2$mpg01))
logistic_preds5[logistic_probs5>0.5] = 1
table(logistic_preds5,actual)
```

$$\text{test error } = \frac{7+15}{79 + 7 + 15 + 86} \approx 0.1176$$


## Meeting Videos

### Cohort 1

`r knitr::include_url("https://www.youtube.com/embed/_W0MGfmpzYY")`

<details>
<summary> Meeting chat log </summary>

```
00:08:27	Kim M:	👋😁
00:09:40	SriRam:	Yay, summer time in SA
00:12:51	SriRam:	lol
00:13:42	SriRam:	Can someone help me with zoom, my Audio doesn’t work. I tried many microphones 🙁, all kinds of settings
00:14:01	SriRam:	Any expert advice?
00:14:05	Raymond Balise:	Mac or Windows?
00:14:10	SriRam:	Mac
00:14:16	Raymond Balise:	I supper common
00:14:27	Raymond Balise:	there is an up arrow next to the mic
00:14:28	SriRam:	🙁
00:14:30	Raymond Balise:	pick the mic there
00:14:40	Raymond Balise:	Zoom usually guesses worng
00:14:53	SriRam:	It says “same as system”
00:15:24	August:	document html knit open that
00:15:24	Ryan Metcalf:	Ctrl + Shift + B for building. Or Knit the single file.
00:15:29	Kim M:	No other option? I see 'same as system' but the other option ('Microphone Array...') is selected
00:15:36	Raymond Balise:	are you using an external mic
00:15:55	SriRam:	Yes < I tried wired and wireless
00:36:18	Jon Harmon (jonthegeek):	FYI: No newline between $$ and the equation to get it to render properly in HTML output. So:
$$Pr(Y = k|X = x) … {(x)}$$ works (even if it's multi-line in-between)
01:08:09	Raymond Balise:	Lovely work.  I need to get to another meeting.
01:12:36	Ryan Metcalf:	Great job Mei Ling!
01:13:04	Kim M:	Me. But I'm not sure I'll be able to get through.
01:13:14	Kim M:	I'll try.
01:13:48	Laura Rose:	sounds good!
01:14:05	Kim M:	Ciao
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/3PriAsFD5Ps")`

<details>
<summary> Meeting chat log </summary>

```
00:22:00	Wayne Defreitas:	YES tidymodels
00:31:06	August:	cs = cubic spline
00:32:57	August:	some additional stuff about splines in r
00:32:58	August:	https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3
00:42:58	Wayne Defreitas:	#can create custom function [metric_set()] to specify which model metrics we want.

#available functions: accuracy(), kap(), sens(),spec(), ppv(), npv(), mcc(), j_index(), bal_accuracy, detection_prevalence(), precision(), recall(), f_meas() 
custom_metrics <- metric_set (accuracy, sens, spec)
00:44:24	Wayne Defreitas:	custom_metrics(leads_results,
         truth=purchased,
         estimate=.pred_class)
00:44:43	August:	https://yardstick.tidymodels.org/reference/metric_set.html
01:02:54	Wayne Defreitas:	Running to another meeting…thanks everyone
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/ZjFSODq2Ryk")`

<details>
<summary> Meeting chat log </summary>

```
00:20:03	Raymond Balise:	nicely done
00:21:33	Mei Ling Soh:	https://www.theanalysisfactor.com/count-data-considered-continuous/#:~:text=The%20issue%20with%20count%20variables,model%2C%20which%20require%20continuous%20data.&text=Treating%20that%20count%20variable%20as,in%20your%20particular%20data%20set.
00:30:04	SriRam:	Just realised that GLM is not in the first edition (the book I have) :(
00:30:54	August:	https://www.statlearning.com/
00:31:05	August:	link at the bottom of page for 2nd edition
00:31:24	SriRam:	Thank you August & Mei
00:32:25	August:	👍 anytime
00:42:27	Federica Gazzelloni:	Thanks Raymond!!
```
</details>

### Cohort 2

`r knitr::include_url("https://www.youtube.com/embed/IMeAnVZgtb8")`

<details>
<summary> Meeting chat log </summary>
```
00:24:30	Ricardo Serrano:	https://github.com/rserran/melbourne_housing_kaggle
00:24:39	Federica Gazzelloni:	thanks
00:59:01	Jim Gruman:	tidy(x, conf.int = FALSE, conf.level = 0.95, exponentiate = FALSE, ...)  the exponentiate = TRUE will back out of the log to give just the odds
00:59:04	Anna-Leigh Brown:	https://www.wolframalpha.com/input/?i=logistic+function
01:06:46	Jim Gruman:	I will always question the glm breakpoint of 0.5, A good discussion for offline of a means of adjustment here: https://towardsdatascience.com/bank-customer-churn-with-tidymodels-part-2-decision-threshold-analysis-c658845ef1f
01:07:22	Jim Gruman:	thank you Michael !
01:09:39	jlsmith3:	Thank you, Michael!
01:09:44	Ricardo Serrano:	Thanks, Michael!
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/tJjejzPj7P4")`

<details>
<summary> Meeting chat log </summary>

```
00:10:56	Ricardo:	https://www.polyu.edu.hk/cbs/sjpolit/logisticregression.html
00:14:57	Federica Gazzelloni:	https://www.coursera.org/specializations/biostatistics-public-health Dr. John McGready, PhD, MS
01:10:26	Jim Gruman:	thank you Michael!  don't eat the mushrooms🍄
01:10:50	jlsmith3:	Thank you, Michael!!
01:11:18	Ricardo:	Thanks, Michael 👍
```
</details>

### Cohort 3

`r knitr::include_url("https://www.youtube.com/embed/776kwEXjL5E")`

<details>
<summary> Meeting chat log </summary>

```
00:43:07	Soroush, Fariborz:	https://www.youtube.com/watch?v=HZGCoVF3YvM
00:43:12	Soroush, Fariborz:	https://www.youtube.com/watch?v=lG4VkPoG3ko
00:43:20	Soroush, Fariborz:	https://www.youtube.com/watch?v=o2Tpws5C2Eg
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/uf5wkpgJyN0")`

<details>
<summary> Meeting chat log </summary>

```
00:05:38	Jeremy Selva:	10pm
00:44:38	Rose Hartman:	This has been super helpful for me, at least! Thank you 🙂
01:03:10	Mei Ling Soh:	We will need to end soon.
01:03:22	Mei Ling Soh:	I can present the lab next week
01:03:31	Mei Ling Soh:	I have prepared the slides up to KNN
01:05:10	Rose Hartman:	Thank you so much for this explanation today! Very helpful.
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/PbXm-08CmUU")`

### Cohort 4

`r knitr::include_url("https://www.youtube.com/embed/gMQ22j5VYTg")`

`r knitr::include_url("https://www.youtube.com/embed/a-NVS5-SRe0")`

### Cohort 5

`r knitr::include_url("https://www.youtube.com/embed/w-2uGV_JfHA")`

<details>
<summary> Meeting chat log </summary>

```
00:37:05	Derek Sollberger (he/him): Yes, you are describing the English and Greek letters correctly
00:50:26	Derek Sollberger (he/him): true positive rate = (# of true positives) / (# of all positives)
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/QE9Rjw11y7g")`

`r knitr::include_url("https://www.youtube.com/embed/db2N1Bbv1Fc")`