cm012.Rmd

# Working with factors in R

Today's class is working with a very important component of R - factors.

## Worksheet

Link to cm012 [worksheet file](https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/tutorials/cm012-exercise.Rmd).

## Resources

### References and tutorials

* Jenny Bryan's notes on [factors](https://stat545.com/factors-boss.html)

### Package documentation

* [forcats](https://forcats.tidyverse.org) package

<!---The following chunk allows errors when knitting--->

```{r allow errors2, echo = FALSE}
knitr::opts_chunk$set(error = TRUE)
```

Load the required libraries. You might need to install library `forcats` first (install.packages("forcats"))

```{r, echo = TRUE}
library(gapminder)
library(tidyverse)
library(dplyr)
library(forcats)
library(ggplot2)
```

## Recap of CM011

Outline of last lecture lecture

* here package
* read/write_csv (and friends)
* read_excel() function from readxl package
* data processing and importing

## Motivating the need for factors in R

### Activity 1: Using Factors for plotting 

**1.1** Let's look again into `gapminder` dataset and create a new cloumn, `life_level`, that contains five categories ("very high", "high","moderate", "low" and "very low") based on life expectancy in 1997. Assign categories accoring to the table below:

| Criteria | life_level| 
|-------------|-----------|
| less than 23 | very low |
| between 23 and 48 | low |
| between 48 and 59 | moderate |
| between 59 and 70 | high |
| more than 70 | very high |

Function `case_when()` is a tidier way to vectorise multiple `if_else()` statements. you can read more about this function [here](https://dplyr.tidyverse.org/reference/case_when.html).

```{r}
gapminder %>% 
  filter(year == 1997) %>% 
  mutate(life_level = case_when(lifeExp < 23 ~ 'very low',
                                lifeExp < 48~ 'low',
                                lifeExp < 59 ~ 'moderate',
                                lifeExp < 70 ~ 'high',
                                TRUE ~ 'very high')) %>% 
  ggplot() + geom_boxplot(aes(x = life_level, y = gdpPercap)) +
  labs(y = "GDP per capita, $", x= "Life expectancy level, years") +
  theme_bw() 
```

Do you notice anything odd/wrong about the graph?

We can make a few observations:

- It seems that none of the countries had a "very low" life-expectancy in 1997. 

- However, since it was an option in our analysis it should be included in our plot. Right?

- Notice also how levels on x-axis are placed in the "wrong" order.

**1.2** You can correct these issues by explicitly setting the levels parameter in the call to `factor()`. Use, `drop = FALSE` to tell the plot not to drop unused levels 
```{r}
gapminder %>% 
  filter(year == 1997) %>% 
  mutate(life_level = factor(case_when(lifeExp < 23 ~ 'very low',
                                lifeExp < 48~ 'low',
                                lifeExp < 59 ~ 'moderate',
                                lifeExp < 70 ~ 'high',
                                TRUE ~ 'very high'),
                levels = c("very low" , "low","moderate", "high","very high"))) %>%  
  ggplot() + geom_boxplot(aes(x = life_level, y = gdpPercap)) +
  labs(y = "GDP per capita, $", x= "Life expectancy level, years") +
  theme_bw() +
  scale_x_discrete(drop = FALSE)
```

## Inspecting factors (activity 2)

In Activity 1, we created our own factors, so now let's explore what categorical variables that we have in the `gapminder` dataset.

### Exploring `gapminder$continent` (activity 2.1)

Use functions such as `str()`, `levels()`, `nlevels()` and `class()` to answer the following questions:

- what class is  `continent`(a factor or charecter)?
- How many levels? What are they?
- What integer is used to represent factor "Asia"?

```{r}
class(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
str(gapminder$continent)
gapminder
```

### Exploring `gapminder$country` (activity 2.2)

Let's explore what else we can do with factors:

Answer the following questions: 

- How many levels are there in `country`?
- Filter `gapminder` dataset by 5 countries of your choice. How many levels are in your filtered dataset?

```{r}
nlevels(gapminder$country)

h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")

h_gap <- gapminder %>%
  filter(country %in% h_countries)

nlevels(h_gap$country)
```

## Dropping unused levels

What if we want to get rid of some levels that are "unused" - how do we do that? 

The function `droplevels()` operates on all the factors in a data frame or on a single factor. The function `forcats::fct_drop()` operates on a factor.

```{r}
h_gap_dropped <- h_gap %>% 
  droplevels()

h_gap_dropped$country  %>%
  nlevels()
```

## Changing the order of levels

Let's say we wanted to re-order the levels of a factor using a new metric - say, count().

We should first produce a frequency table as a tibble using `dplyr::count()`:

```{r}
gapminder %>% 
  count(continent)
```

The table is nice, but it would be better to visualize the data.
Factors are most useful/helpful when plotting data.
So let's first plot this:

```{r}
gapminder %>%
  ggplot() +
  geom_bar(aes(continent)) +
  coord_flip()+
  theme_bw() +
  ylab("Number of entries") + xlab("Continent")
```

Think about how levels are normally ordered. 
It turns out that by default, R always sorts levels in alphabetical order. 
However, it is preferable to order the levels according to some principle:

  1. Frequency/count. 
  
- Make the most common level the first and so on. Function `fct_infreq()` might be useful.
- The function `fct_rev()` will sort them in the opposite order.

For instance ,
    `  
```{r}
gapminder %>%
  ggplot() +
  geom_bar(aes(fct_infreq(continent))) +
  coord_flip()+
  theme_bw() +
  ylab("Number of entries") + xlab("Continent")
```

Section 9.6 of Jenny Bryan's [notes](https://stat545.com/factors-boss.html#reorder-factors) has some helpful examples.

  2. Another variable. 
  
  - For example, if we wanted to bring back our example of ordering `gapminder` countries by life expectancy, we can visualize the results using `fct_reorder()`. 

```{r}
##  default summarizing function is median()
gapminder %>%
  ggplot() +
  geom_bar(aes(fct_reorder(continent, lifeExp, max))) +
  coord_flip()+
  theme_bw() +
  xlab("Continent")+ylab("Number of entries") 
```

Use `fct_reorder2()` when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. 

```{r}
## order by life expectancy 
ggplot(h_gap, aes(x = year, y = lifeExp,
                  color = fct_reorder2(country, year, lifeExp))) +
  geom_line() +
  labs(color = "country")
```

## Change order of the levels manually

This might be useful if you are preparing a report for say, the state of affairs in Africa.

```{r}
gapminder %>%
  ggplot() +
  geom_bar(aes(fct_relevel(continent,"Oceania"))) +
  coord_flip()+
  theme_bw() 
```
More details on reordering factor levels by hand can be found [here] https://forcats.tidyverse.org/reference/fct_relevel.html

### Recoding factors

Sometimes you want to specify what the levels of a factor should be.
For instance, if you had levels called "blk" and "brwn", you would rather they be called "Black" and "Brown" - this is called recoding.
Lets recode `Oceania` and the `Americas` in the graph above as abbreviations `OCN` and `AME` respectively using the function `fct_recode()`.

```{r}
gapminder %>%
  ggplot() +
  geom_bar(aes(fct_recode(continent,"OCN"="Oceania" , "AME" = "Americas"))) +
  coord_flip()+
  theme_bw()
```

## Grow a factor (OPTIONAL)

Let’s create two data frames,`df1` and `df2` each with data from two countries, dropping unused factor levels.
```{r}
df1 <- gapminder %>%
  filter(country %in% c("United States", "Mexico"), year > 2000) %>%
  droplevels()
df2 <- gapminder %>%
  filter(country %in% c("France", "Germany"), year > 2000) %>%
  droplevels()
```

The country factors in df1 and df2 have different levels.
Can we just combine them?
```{r}
c(df1$country, df2$country)
```

The country factors in `df1` and `df2` have different levels.
Can you just combine them using `c()`?
```{r}
fct_c(df1$country, df2$country)
```

Explore how different forms of row binding work behave here, in terms of the country variable in the result.
```{r}
bind_rows(df1, df2)
rbind(df1, df2)
```