14-glm-practice-titanic.Rmd

---
title: "GLM practice"
---

# Goals

- Gain experience fitting and interpreting a binomial GLM with interactions
- Gain familiarity with interpreting centered and scaled (standardized) predictors including centered binary predictors
 
# The Titanic data set 

The following data set is from <https://www.kaggle.com/c/titanic/data>. It represents the passengers on the Titanic, whether they survived or not, and a number of characteristics about them. We'll be working with columns that represent passenger age, passenger gender (female = 1, male = 0), and the fare price they paid for their ticket. We'll be using these characteristics to predict if a passenger survived (yes = 1, no = 0).

You can read in the data and clean it up a bit with the following code:

```{r, message=FALSE}
library(tidyverse)
d <- read_csv("data/raw/titanic.csv")
d <- mutate(d, female = ifelse(Sex == "female", 1, 0))
names(d) <- tolower(names(d))
d <- select(d, survived, age, fare, female) %>% na.omit %>% as_data_frame()
d
# View(d)
```

# Explore the data set 

Start by taking a few minutes to explore the data graphically. What patterns do you see and what might you expect when you model these data? 

```{r}
ggplot(d, aes(age, survived, colour = as.factor(female), size = fare)) + # exercise
  geom_point(position = position_jitter(height = 0.2)) # exercise

ggplot(d, aes(fare, survived, colour = as.factor(female), size = age)) +  # exercise
  geom_point(position = position_jitter(height = 0.2)) # exercise

ggplot(d, aes(age, survived, colour = log(fare))) +  # exercise
  geom_point(position = position_jitter(height = 0.2)) + # exercise
  facet_wrap(~female) # exercise
```

# A simple model to start with

Start with a simple model with the 3 predictors and no interactions. We are working with binary data as a response. What distribution and link would make sense? 

```{r}
m <- glm(survived ~ 
    age + fare + female, data = d, family = binomial(link = "logit")) # exercise
arm::display(m)

plot(ggeffects::ggpredict(m)) %>%
  cowplot::plot_grid(plotlist = .)

sjPlot::plot_model(m, type = "est")
```

Note that `sjPlot::plot_model()` is showing us *odds ratios*, not *log* odds ratios as we see with `summary()` or `arm::display()`.

In this initial model, how much greater are the odds of surviving for female passengers versus male passengers? 

Now try adding all 2-way interactions. Remember that there is a shortcut for this. Try comparing this model to the model without any interactions via AIC.

```{r}
m2 <- glm(survived ~
    (age + fare + female) ^ 2, data = d, family = binomial(link = "logit")) # exercise
arm::display(m2)
bbmle::AICtab(m, m2) # exercise
sjPlot::plot_model(m2)
```

In the above models, most of the coefficients seem really small. Why is that? 

# A standardized model 

Now we're going to refit the above model with versions of the predictors that have been scaled (by dividing by 2 standard deviations) and centered (by subtracting their mean value). For the binary predictor (`female`) the variable will be centered by its mean value but not scaled. This standardizing procedure will make the magnitude of the coefficients approximately comparable. (It will make them per 2 SDs of the raw predictor.)

```{r}
d$age_scaled <- arm::rescale(d$age)
d$fare_scaled <- arm::rescale(d$fare)
d$female_centered <- arm::rescale(d$female)
# or
# d$female_centered <- d$female - mean(d$female) # same thing

# or:
# m3 <- arm::standardize(m2)
# but we will use arm::rescale so it is clear what we are doing 
```

Now refit the model with all 2-way interactions but with the scaled versions of age and fare: `age_scaled` and `fare_scaled`. This time use the 0-1 version of `female`:

```{r}
m3 <- glm(survived ~ 
    (age_scaled + fare_scaled + female)^2, data = d, family = binomial()) # exercise
arm::display(m3)
sjPlot::plot_model(m3)
```

Now fit the same model but with `female_centered`:

```{r}
m4 <- glm(survived ~ 
    (age_scaled + fare_scaled + female_centered)^2, data = d, family = binomial()) # exercise
arm::display(m4)
sjPlot::plot_model(m4)
```

# Interpreting the standardized models 

If a man on the Titanic had paid \$250 for his ticket, what are his odds of surviving compared to another man who paid \$150? (Note that a unit change of one of the standardized predictors represents 2 standard deviations of the raw variable. And 2 standard deviations of `fare` is about \$100). I.e.:

```{r}
round(sd(d$fare) * 2, 1)
```

Use `m3`, the model with `female` instead of `female_centered`:

```{r}
exp(coef(m3)["fare_scaled"]) # exercise
```

And what would be the odds ratio for 2 women's chances of survival if one had paid an extra \$100 (again, 1 unit change for the scaled variable)?

```{r}
exp(coef(m3)[["fare_scaled"]] + coef(m3)[["fare_scaled:female"]] * 1) # exercise
```

And now for the model with the centered predictor for gender. Take a look at the data frame again. What are the 2 unique values in the `female_centered` column? Remember that we subtracted the mean value of that column. 

```{r}
unique(d$female_centered) %>% round(2)
unique(d$female)
mean(d$female) %>% round(2)
```

Why is the effect of `fare_scaled` different between the 2 models? What do these 2 coefficients represent?

```{r}
exp(coef(m3)[["fare_scaled"]]) # uncentered model
exp(coef(m4)[["fare_scaled"]]) # centered model
```

In `m3`, the male/female variable is 0/1 and so `exp(coef(m3)[["fare_scaled"]])` represents the multiplicative effect on the odds of survival for `female = 0`, i.e., male. <!-- exercise -->

In `m4`, the male/female variable is -0.37/0.63 and so `exp(coef(m4)[["fare_scaled"]])` represents the multiplicative effect on the odds of survival for an average individual (male or female) across the whole dataset. <!-- exercise -->

```{r}
sjPlot::plot_model(m3) # uncentered model
sjPlot::plot_model(m4) # centered model
```

How would you derive the same odds ratios as above? (I.e. How can we determine the effect of paying approximately an extra \$100 for a ticket for a man and for a woman using the coefficients from the centered moel `m4`?) 
Remember that these models make the same predictions, they are just parameterized slightly differently.

```{r}
exp(coef(m4)[["fare_scaled"]] + coef(m4)[["fare_scaled:female_centered"]] * 0.63) # exercise

exp(coef(m4)[["fare_scaled"]] + coef(m4)[["fare_scaled:female_centered"]] * -0.37) # exercise
```

What are advantages and disadvantages to the approaches taken for `m3` and `m4`? (I.e. centering or not centering the binary predictor `female`.)

- In the centered model, the main effect of fare is immediately apparent. <!-- exercise -->
- In the centered model, intercept is the average log odds of survival, which may be of interest. <!-- exercise -->
- In the uncentered model, the effect of fare for males and the effect of fare on females is easier to derive but it takes work to figure out the average effect. <!-- exercise -->
- In the uncentered model, intercept is the log odds of survival for males and the equivalent for females can be calculated more easily. <!-- exercise -->
- You can always fit the model both ways (or calculate the equivalent effects, as above) after the fact if you'd like. <!-- exercise -->

# Bonus 

If you made it this far, read in the original data frame again and try plotting and modeling some of the other predictors. Are there other interesting patterns to be found? Can you find a model with a lower AIC? If you need more documentation on the data, have a look at <https://www.kaggle.com/c/titanic/data>.

```{r}
d <- read_csv("data/raw/titanic.csv")
d <- mutate(d, female = ifelse(Sex == "female", 1, 0))
names(d) <- tolower(names(d))
# View(d)
```

Your turn:

```{r}

```