Oz_fires_DC_remix.Rmd

---
title: "Data Carpentry Workshop Remix Jan 15, 2020"
author: "Rachel Schwartz"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Setup

Instructors talk about the [Data Carpentry Intro](https://datacarpentry.org/R-ecology-lesson/00-before-we-start.html) before continuing the lesson below.

Note that this lesson is adapted from the [Data Carpentry Ecology Lesson](https://datacarpentry.org/R-ecology-lesson/index.html). You can find a lot more information there. Text that has been copied from this lesson is highlighted in blue.

```{r pack, warning=FALSE}
#install.packages("tidyverse")
library(tidyverse)
```

## Get the Data

The data used for this less is all about Australia, including its climate over time and recent fires. To learn more about Australia's devastating fires check out this NY Times [article](https://www.nytimes.com/interactive/2020/01/02/climate/australia-fires-map.html). The data and information comes from Tidy Tuesday, a weekly data project aimed at the R ecosystem. You can find more datasets at https://github.com/rfordatascience/tidytuesday . 

The data are available online so the first thing we'll do is load them into R.
To load data you'll need to use a *function*.

<p style="color:blue">
Functions are "canned scripts" that automate more complicated sets of commands
including operations assignments, etc. Many functions are predefined, or can be
made available by importing R *packages*. A function
usually takes one or more inputs called *arguments*. Functions often (but not
always) return a *value*. A typical example would be the function `sqrt()`. The
input (the argument) must be a number, and the return value (in fact, the
output) is the square root of that number. Executing a function ('running it')
is called *calling* the function.
</p>

<p style="color:blue">
Packages in R are basically sets of additional functions that let you do more
stuff. The functions we've been using so far, like `str()` or `data.frame()`,
come built into R; packages give you access to more of them. Before you use a
package for the first time you need to install it on your machine, and then you
should import it in every subsequent R session when you need it. You should
already have installed the **`tidyverse`** package. This is an
"umbrella-package" that installs several packages useful for data analysis which
work together well such as **`tidyr`**, **`dplyr`**, **`ggplot2`**, **`tibble`**, etc.
</p>

We'll read in our data using the `read_csv()` function, from the tidyverse package **`readr`**.
We assign each dataset to a variable so we can reuse it.
The name of the variable is on the left side of the arrow.
The action we're doing with the function is on the right side.

```{r data}
rainfall <- read_csv('https://tinyurl.com/Oz-fire-rain')
temperature <- read_csv('https://tinyurl.com/Oz-fire-temp')
```

You can see some information about the data we have just loaded.
The name of each column is shown along with the type of data in that column.
The data are stored in a format we call a data frame.

<p style="color:blue">
Data frames are the _de facto_ data structure for most tabular data, and what we
use for statistics and plotting. A data frame can be created by hand, but most commonly they are generated by the functions such as `read_csv()`; in other words, when importing
spreadsheets from your hard drive (or the web). A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors).
</p>

<p style="color:blue">
You will see the message `Parsed with column specification`, followed by each column name and its data type. When you execute `read_csv` on a data file, it looks through the first 1000 rows of each column and guesses the data type for each column as it reads it into R. For example, in this dataset, `read_csv` reads columns as `col_double` (a numeric data type), and as `col_character`. You have the option to specify the data type for a column manually by using the `col_types` argument in `read_csv`.
</p>

<p style="color:blue">
If you want to inspect your data frame there are several functions to do so. `head()` and `str()` can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. 

* Size:
    * `dim(surveys)` - returns a vector with the number of rows in the first element,
          and the number of columns as the second element (the **dim**ensions of
          the object)
    * `nrow(surveys)` - returns the number of rows
    * `ncol(surveys)` - returns the number of columns
* Content:
    * `head(surveys)` - shows the first 6 rows
    * `tail(surveys)` - shows the last 6 rows
* Names:
    * `names(surveys)` - returns the column names (synonym of `colnames()` for `data.frame`
	   objects)
    * `rownames(surveys)` - returns the row names
* Summary:
    * `str(surveys)` - structure of the object and information about the class, length and
	   content of  each column
    * `summary(surveys)` - summary statistics for each column
</p>

For more details on this dataset see the [Tidy Tuesday site](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-07).

<p style="color:blue">
This statement doesn’t produce any output because assignments don’t display anything. 
If we want to check that our data has been loaded, we can see the contents of the data frame by typing its name:`rainfall`.
</p>

```{r d2}
rainfall
```

You can also view the data in a separate window by clicking on its name in the Global Environment window.
Here you can also see some information about the data. Note how much data we have!
Click the arrow next to `rainfall` to view the columns and type of data in each column.
For more information about this dataset (i.e.\ the metadata) see [https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-07/readme.md](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-07/readme.md)

Note that we've read our data from a website, but you can read local files as well.
These files are in csv format, which means plain text where columns are separated by commas.
This is a very simple format that avoids all the complexity of Excel.
If you need to read an Excel file you can either export as a csv or use a different function to read the data.

For more information on data frames see the [Starting with Data](https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html#what_are_data_frames) section of the Data Carpentry lesson.

## Basic data exploration

One handy way to look at your data is to get a summary table.
Let's do this for the temperature dataset.

```{r}
summary(temperature)
```

For a really basic exploratory analysis let's look at how temperature is changing in Australian cities over time.
I've summarized our original dataset so you can make your first plot more easily.
First: load the summary data, which can be found at **https://tinyurl.com/Oz-mean-temp**.

```{r t}
#Challenge: load data
yearly_temp <- read_csv('https://tinyurl.com/Oz-mean-temp')
```

Now we'll plot the temperature as a function of time.

<p style="color:blue">
### Plotting with **`ggplot2`**

**`ggplot2`** is a plotting package that makes it simple to create complex plots
from data in a data frame. It provides a more programmatic interface for
specifying what variables to plot, how they are displayed, and general visual
properties. Therefore, we only need minimal changes if the underlying data change
or if we decide to change from a bar plot to a scatter plot. This helps in creating
publication quality plots with minimal amounts of adjustments and tweaking.<br><br>
**`ggplot2`** functions like data in the 'long' format, i.e., a column for every dimension,
and a row for every observation. Well-structured data will save you lots of time
when making figures with **`ggplot2`**<br><br>
ggplot graphics are built step by step by adding new elements. Adding layers in
this fashion allows for extensive flexibility and customization of plots.<br><br>
To build a ggplot, we will use the following basic template that can be used for different types of plots:

```
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()
```

- use the `ggplot()` function and bind the plot to a specific data frame using the
      `data` argument

```{r}
ggplot(data = yearly_temp)
```

- define a mapping (using the aesthetic (`aes`) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.

</span>

```{r t2a}
ggplot(yearly_temp, aes(year,temperature))
```

<span style="color: blue">
- add 'geoms' – graphical representations of the data in the plot (points,
  lines, bars). **`ggplot2`** offers many different geoms; we will use some 
  common ones today, including:
  
  * `geom_point()` for scatter plots, dot plots, etc.
  * `geom_boxplot()` for, well, boxplots!
  * `geom_line()` for trend lines, time series, etc.  

<p style="color:blue">
To add a geom to the plot use the `+` operator. Because we have two continuous variables,
let's use `geom_point()` first:
</p>

```{r t2}
ggplot(yearly_temp, aes(year,temperature)) +
  geom_point()
```

Notice that you have a lot of data points for each year.
This is because you have a data point on the mean temperature for each city.
Let's change our plot to show each city in a different color.

```{r t3}
ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_point()
```

That's a bit clearer.
Let's show the pattern over time by adding lines between the yearly points.
```{r t4}
ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_point()+geom_line()
```

And let's tidy this into a publication-quality plot.
```{r t5}
ggplot(yearly_temp, aes(year,temperature, color = city_name)) +
  geom_line() +
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw()
```

In a few quick commands we can already plot temperature and observe how it's been increasing.

**Notes**

<span style="color: blue">
- Anything you put in the `ggplot()` function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in `aes()`.<br>
- You can also specify mappings for a given geom independently of the mappings defined globally in the `ggplot()` function.<br>
- The `+` sign used to add new layers must be placed at the end of the line containing the *previous* layer. If, instead, the `+` sign is added at the beginning of the line containing the new layer, **`ggplot2`** will not add the new layer and will return an error message.
</span>

## Manipulating data

In the prior section I gave you a summary table of temperature data.
Let's consider how you could generate this summary table and do other data manipulation
given our original datasets.

<p style="color:blue">
### Data Manipulation using **`dplyr`** and **`tidyr`**

The **`tidyverse`** package tries to address 3 common issues that arise when
doing data analysis with some of the functions that come with R:

1. The results from a base R function sometimes depend on the type of data.
2. Using R expressions in a non standard way, which can be confusing for new
   learners.
3. Hidden arguments, having default operations that new learners are not aware
   of.
   
The package **`dplyr`** provides easy tools for the most common data manipulation
tasks. It is built to work directly with data frames, with many common tasks
optimized by being written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The benefits of
doing this are that the data can be managed natively in a relational database,
queries can be conducted on that database, and only the results of the query are
returned.

This addresses a common problem with R in that all operations are conducted
in-memory and thus the amount of data you can work with is limited by available
memory. The database connections essentially remove that limitation in that you
can connect to a database of many hundreds of GB, conduct queries on it directly, and pull
back into R only what you need for analysis.

The package **`tidyr`** addresses the common problem of wanting to reshape your data for plotting and use by different R functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is non-trivial, and **`tidyr`** gives you tools for this and more sophisticated data manipulation.

To learn more about **`dplyr`** and **`tidyr`** after the workshop, you may want to check out this
[handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).

We're going to learn some of the most common **`dplyr`** functions:

- `select()`: subset columns
- `filter()`: subset rows on conditions
- `mutate()`: create new columns by using information from other columns
- `group_by()` and `summarize()`: create summary statistics on grouped data
- `arrange()`: sort results
- `count()`: count discrete values

### Selecting columns and filtering rows

To choose rows based on a specific criterion, use `filter()`.
</p>

For our data let's speculate that we may be seeing an increase in fires due to 
changes in maximum daily temperatures.
After all, it could be spiking temperatures that allow fires to occur,
and that wouldn't be reflected well in the mean temperature for a day.
In this case we need to filter our data to include only data where the column
`temp_type` is listed as max.

```{r f}
temperatures_maxs <- filter(temperature,temp_type == "max")
```

Now consider what summary information you want.
You probably want the average maximum temperature for each city for each year
to look at changes over time.
Let's look at how we would calculate the average maximum temperature for one city for one year.
Then we'll be able to extend this to all cities and all years. 

First filter your data to include only data from 2019 at "PERTH AIRPORT".
This is a challenge for you.

```{r f1}
temperatures_maxs_PerthAir <- filter(temperatures_maxs,site_name == "PERTH AIRPORT")

```

You should have gotten stuck on how we know whether data came from 2019.
That information is in the date column but you have to extract it.
This is a great lesson.
You first need to think about what you want your data to look like.
Once you know what you want you can figure out how to communicate that to the computer.

In this case we'll use the `lubridate` package to process date information.

Challenge: load the lubridate library.

```{r pack2, warning=FALSE}
library(lubridate)
```

You can use the `year` function to extract the year from the date column.
Assign that to a new column in your data frame.

```{r d}
temperatures_maxs_PerthAir$year <- year(temperatures_maxs_PerthAir$date)
```

Now you can filter for data from 2019.
This is a challenge.

```{r f2}
temperatures_maxs_PerthAir_2019 <- filter(temperatures_maxs_PerthAir,year == 2019)
```

Now you can calculate the mean max temperature for this site in this year using the mean
function. This is a challenge.

```{r m}
mean(temperatures_maxs_PerthAir_2019$temperature)
```

<p style="color:blue">
### Pipes

What if you want to do multiple filter steps at the same time? There are three
ways to do this: use intermediate steps, nested functions, or pipes.

You probably took the intermediate step approach in your prior work.

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

```{r}
temperatures_maxs$year <- year(temperatures_maxs$date)
temperatures_maxs_PerthAir_2019 <- filter(filter(temperatures_maxs,site_name == "PERTH AIRPORT"),year == 2019)
```
This is handy, but can be difficult to read if too many functions are nested, as
R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, *pipes*, let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset.  Pipes in R look like
`%>%` and are made available via the **`magrittr`** package, installed automatically
with **`dplyr`**. 
</p>

```{r}
temperatures_maxs_PerthAir_2019 <- temperatures_maxs %>%
  filter(site_name == "PERTH AIRPORT") %>%
  filter(year == 2019)
```

Note that the data are sent from one function to the next so in subsequent functions
you do not provide the name of the data.
Make sure to put the pipe at the end of the line when using multiple lines so the processing
doesn't end prematurely.

<p style="color:blue">
Some may find it helpful to read the pipe like the word "then". For instance,
in the above example, we took the data frame `temperatures_max`, *then* we `filter`ed
for rows with `site_name == "PERTH AIRPORT"`, *then* we `filter`ed
for rows with `year == 2019`.
Make sure to use the double equals sign when checking for equality.
A single equals will assign data to a variable.

### Split-apply-combine data analysis and the `summarize()` function

Now we know how to calculate one example, but we would like the mean max temperature
for all cities for all years.
To get this information we'll generate a summary table.

<p style="color:blue">
Many data analysis tasks can be approached using the *split-apply-combine*
paradigm: split the data into groups, apply some analysis to each group, and
then combine the results. **`dplyr`** makes this very easy through the use of the
`group_by()` function.

#### The `summarize()` function

`group_by()` is often used together with `summarize()`, which collapses each
group into a single-row summary of that group.  `group_by()` takes as arguments
the column names that contain the **categorical** variables for which you want
to calculate the summary statistics. 
</p>

In this case we are interested in one result for each city for each year.
That means we'll group by these variables.
Then we'll generate a summary table.
In this table we want to calculate the mean of the temperature for each group.

```{r}
mean_max_temp <- temperatures_maxs %>%
  filter(!is.na(temperature)) %>%
  group_by(city_name,year) %>%
  summarize(mean_max_temp = mean(temperature))
```

Now you can plot. This is a challenge.
Don't forget to fix your axis labels.
```{r}
ggplot(mean_max_temp, aes(year, mean_max_temp,color=city_name)) +
         geom_line()+
  labs(x = "Year", y = "Mean Temperature (Celsius)", color = "") +
  theme_bw()
```

Note that we have lots of data but it's a bit hard to compare in the table.
That's because when we work with data in R we typically use "long form".
That means for everyone combination of variables (year and city),
we have one observation (mean temp).
Humans typically like to see data in "wide form".
We could have year as rows and city as columns and the values in the middle as temperature.
To convert from long form to wide form we can use the `spread`
function.

```{r}
mean_max_temp_wide <- mean_max_temp %>%
  spread(key = city_name,value = mean_max_temp)
```

Click on the data in the Environment to see it.

You can save your new human-readable data.

```{r}
write_csv(mean_max_temp_wide, path = "data/mean_max_temp_wide.csv")
```

If you get data in wide form and need to convert it to long form to use in R use the `gather` function. This is a bit more complicated as you need to specify the columns you want to gather, the name of the column where those column names go, and the name of the column where all the values go.

```{r}
mean_max_temp_widetolong <- mean_max_temp_wide %>%
  gather(-year, key = city_name,value = mean_max_temp)

mean_max_temp_widetolong <- mean_max_temp_wide %>%
  gather(2:8, key = city_name,value = mean_max_temp)

mean_max_temp_widetolong <- mean_max_temp_wide %>%
  gather(BRISBANE:SYDNEY, key = city_name,value = mean_max_temp)
```

## More data manipulation

Another analysis we might be interested in is whether temperatures for individual months are increasing.

First let's extract the month from the date information. This time we'll do it a bit differently

```{r}
temperature <- temperature %>%
  mutate(year = year(date), month = month(date))
```

Challenge: Examine how the temperature has increased over time in Canberra. Filter for data from Canberra. Get the average max temperature in each month for each year. Spread year to be viewable.
```{r}
monthly_temps_CAN <- temperature %>%
  filter(city_name == "CANBERRA") %>%
  filter(temp_type == "max") %>%
  group_by(month, year) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) %>%
  spread(year, avg_temp)
head(monthly_temps_CAN)

```


Now we can summarize by month and plot.

```{r}
monthly_temps <- temperature %>%
  group_by(month, year) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) 

ggplot(monthly_temps, aes(month, avg_temp, group=year, color=year)) +
  geom_line() + geom_point()
```

Did you notice that the color is along a scale?
For R, year is a continuous variable so the scale is continuous.
This probably isn't what you want.
You might view year as a factor to separate the years distincly.

```{r}
ggplot(monthly_temps, aes(month, avg_temp, group=year, color=factor(year))) +
  geom_line() + geom_point()
```

On the other hand, although you can see lighter colors mostly at the top, all those data are really messy anyway.
There are really too many categories for a discrete variable.
Let's try grouping the information by before and after 1964 to make things simpler.
We have already seen `mutate` to add a column.
This time we want to add a column that designates whether the date is before
or after 1964 using 0 or 1.
We use the `ifelse` function to do this.
Then we take this updated dataframe and group by both month and time period
and summarize the average temperature for each group.
Note `na.rm=T`.

```{r}
monthly_temps_by_period <- temperature %>%
  mutate(time_period = ifelse( year <= 1964, 0, 1)) %>%
  group_by(month, time_period) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) 
```

Now we can plot the result.
The x-axis is month (notice we've added the month function so this looks like a date).
The y-axis is the temperature.
And we have separate lines for our two time periods.

```{r}
ggplot(monthly_temps_by_period, aes(month(month, label = T), avg_temp, 
                                    group=time_period, col=factor(time_period))) +
  geom_line() + geom_point() +
  scale_color_discrete(name = "", labels = c("Before 1964", "After 1964")) +
  labs(y = "Average Temperature (°C)",
       x = "Month") 

```

What do you observe about how monthly temperatures differ?

Save your plot!

```{r, eval=FALSE}
ggsave('avg_monthly_temps.png')
```

We can also look at cities separately.
Notice that I've redone my grouping for my summary table to include city.
Notice the `facet_wrap` function here to automatically create separate plots.

```{r}
monthly_temps_by_period <- temperature %>%
  mutate(time_period = ifelse( year <= 1964, 0, 1)) %>%
  group_by(month, time_period, city_name) %>%
  summarize(avg_temp = mean(temperature, na.rm = T)) 

ggplot(monthly_temps_by_period, aes(month(month, label = T), avg_temp, 
                                    group=time_period, col=factor(time_period))) +
  geom_line() + geom_point() + facet_wrap(~city_name) +
  scale_color_discrete(name = "", labels = c("Before 1964", "After 1964")) +
  labs(y = "Average Temperature (°C)",
       x = "Month") 
```