class2.Rmd

---
title: "Introduction to R, Class 2: Working with data"
output: github_document
---

<!--class2.md is generated from class2.Rmd. Please edit that file -->

```{r setup, include=FALSE, purl=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Objectives

In the last section, 
we learned some fundamental principles of R and how to work in RStudio.

In this session,
we'll continue our introduction to R by working with a large dataset 
that more closely resembles that which you may encounter while analyzing data for research. 
By the end of this session, you should be able to:

- import spreadsheet-style data into R as a data frame
- extract portions of data from a data frame
- manipulate factors (categorical data)

## Importing spreadsheet-style data into R

Open RStudio, and we'll check to make sure you're ready to start work again.
You can check to see if you're working in your project directory 
by looking at the top of the Console. 
You should see the path (location in your computer)
for the project directory you created last time 
(e.g., `~/Desktop/intro_r`).

If you do not see the path to your project directory,
go to `File -> Open Project` and navigate to the location of your project directory.
Alternatively, using your operating system's file browser,
double click on the `r_intro.Rrpoj` file.

Create a new R script (`File -> New File -> R Script`)
and save it in your project directory with the name `class2.R`.
Place the following comment on the top line as a title:

`# Introduction to R: Class 2`

In the last session, we recommended organizing your work in 
directories (folders) according to projects.
While a thorough discussion of project organization is beyond the scope of this class,
we will continue to model best practices by creating a directory to hold our data:

```{r eval=FALSE}
# make a directory
dir.create("data")
```

```{r include=FALSE, purl=FALSE}
# code above not run, so this hidden code creates the directory if necessary
if (!dir.exists("data")) {dir.create("data")}
```

You should see the new directory appear in your project directory, 
in the lower right panel in RStudio. 
There is also a button in that panel you can use to create a new folder,
but including the code to perform this task makes other people 
(and yourself) able to reproduce your work more easily.

Now that we have a place to store our data,
we can go ahead and download the dataset:

```{r}
# download data from url
download.file("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv", "data/clinical.csv")
```

The code above has two arguments, 
both encompassed in quotation marks:
first, you indicate where the data can be found online. 
Second, you indicate where R should store a copy of the file on your own computer.

The output from that command may look alarming,
but it represents information confirming it worked.
You can click on the `data` folder to ensure the file is now present.

Notice that the URL above ends in `clinical.csv`,
which is also the name we used to save the file on our computers.
If you click on the URL and view it in a web browser, 
the format isn't particularly easy for us to understand.
You can also view the file by clicking on it in the lower right hand panel,
then selecting "View File."

> The option to "Import Dataset" you see after clicking on the file 
> references some additional tools present in RStudio 
> that can assist with various kinds of data import.
> Because this requires installing additional software, 
> complete exploration of these options is outside the scop of this class.
> For more information, check out [this article](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio).

The data we've downloaded are in csv format, 
which stands for "comma separated values."
This means the data are organized into rows and columns,
with columns separated by commas.

These data are arranged in a tidy format,
meaning each row represents an observation, 
and each column represents a variable (piece of data for each observation).
Moreover, only one piece of data is entered in each cell. 

Now that the data are downloaded, 
we can import the data and assign to an object:

```{r}
# import data and assign to object
clinical <- read.csv("data/clinical.csv")
```

You should see `clinical` appear in the Environment window on the upper right panel in RStudio.
If you click on `clinical` there,
a new tab will appear next to your R script in the Source window.

> Clicking on the name of an object in the Environment window 
> is a shortcut for running `View(clinical)`;
> you'll see this code appear in the Console after clicking.

Now that we have the data imported and assigned to an object,
we can take some time to explore the data we'll be using for the rest of this course:

- These data are clinical cancer data from the [National Cancer Institute's Genomic Data Commons](https://gdc.cancer.gov),
specifically from The Cancer Genome Atlas, or [TCGA](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga).
- Each row represents a patient, 
and each column represents information about demographics (race, age at diagnosis, etc) 
and disease (e.g., cancer type).
- The data were downloaded and aggregated using an R script,
which you can view in the [GitHub repository for this course](https://github.com/fredhutchio/R_intro/blob/master/extra/clinical_data.R).

The function we used to import the data is one of a family of commands used to import the data. 
Check out the help documentation for `read.csv` for more options for importing data.

> You can also import data directly into R using `read.csv`,
> using `clinical <- read.csv("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv")`.
> For these lessons, we model downloading and importing in two steps, 
> so you retain a copy of the data on your computer. 
> This reflects how you're likely to import your own data,
> as well as recommended practice for retaining data used in an analysis (since data online may be updated).

> #### Challenge-data
Download, inspect, and import the following data files. 
The URL for each sample dataset is included along with a name to assign to the object.
(Hint: you can use the same function as above,
but may need to update the `sep = ` parameter)
- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.tsv, object name: `example1`
- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.txt, object name: `example2`

Importing data can be tricky and frustrating, 
However, if you can't get your data into R, 
you can't do anything to analyze or visualize it.
It's worth understanding how to do it effectively to save you time and energy later.

## Data frames 

Now that we have data imported and available,
we can start to inspect the data more closely. 

These data have been interpreted by R to be a data frame,
which is a data structure (way of organizing data) that is analogous to tabular or spreadsheet style data.
By definition, a data frame is a table made of vectors (columns) of all the same length.
As we learned in our last session,
a vector needs to include all of the same type of data (e.g., character, numeric).
A data frame, however, 
can include vectors (columns) of different data types.

To learn more about this data frame,
we'll first explore its dimensions:

```{r}
# assess size of data frame
dim(clinical)
```
The output reflects the number of rows first (6832), 
then the number of columns (20).

We can also preview the content by showing the first few rows:

```{r}
# preview first few rows
head(clinical) 
```

The default number of rows shown is six.
You can specify a different number using the `n = ` parameter,
demonstrated below using `tail`, 
which shows the last few rows

```{r}
# show last three rows
tail(clinical, n = 3) 
```

We often need to reference the names of columns,
so it's useful to print only those to the screen:

```{r}
# view column names
names(clinical) 
```

It's also possible to view row names using`rownames(clinical)`,
but our data only possess numbers for row names so it's not very informative.

As we learned last time, 
we can use `str` to provide a general overview of the object:

```{r}
# show overview of object
str(clinical) 
```
The output provided includes:

- data structure: data frame
- dimensions: 6832 rows and 20 columns
- column-by-column information: each prefaced with a `$`, and includes the column name, data type (num, int, Factor)

> Factors are how character data are interpreted by R in data frames.
> We'll talk more about working with factors at the end of this lesson.

Finally, we can also examine basic summary statistics for each column:

```{r}
# provide summary statistics for each column
summary(clinical) 
```

For numeric data (such as `year_of_death`),
this output includes common statistics like median and mean,
as well as the number of rows (patients) with missing data (as `NA`).
For factors (character data, such as `disease`), 
you're given a count of the number of times the top six most frequent factors (categories) 
occur in the data frame.

## Subsetting data frames

Now that our data are available for use, we can begin extracting relevant information from them.

```{r}
# extract first column and assign to a variable
first_column <- clinical[1]
```

As discussed last time with vectors, 
the square brackets (`[ ]`) are used to subset, 
or reference part of, 
a data frame.
You can inspect the output object by clicking on it in the environment.
It contains all of the rows for only the first column.

When a single number is included in the square brackets, 
R assumes you are referencing a column. 
When you include two numbers in square brackets separated by a comma,
R assumes the **first number references the row** and the **second number references the column** you desire.

This means you can also reference the first column as follows:

```{r}
# extract first column
first_column_again <- clinical[ , 1]
```

Leaving one field blank means you want the entire set in the output
(in this case, all rows).

> #### Challenge-extract
> What is the difference in results between the last two lines of code?

Similarly, we can also extract only the first row across all columns:

```{r}
# extract first row 
first_row <- clinical[1, ]
```

We can also extract slices, or sections of rows and columns,
such as a single cell:

```{r}
# extract cell from first row of first column
single_cell <- clinical[1,1]
```

To extract a range of cells, 
we use the same colon (`:`) syntax from last time:

```{r}
# extract a range of cells, rows 1 to 3, second column
range_cells <- clinical[1:3, 2]
```

This works for ranges of columns as well.

We can also exclude particular parts of the dataset using a minus sign:

```{r}
# exclude first column
exclude_col <- clinical[ , -1] 
```

Combining what we know about R syntax,
we can also exclude a range of cells using the `c` function:

```{r}
# exclude first 100 rows
exclude_range <- clinical[-c(1:100), ] 
```

So far, we've been referencing parts of the dataset based on index position,
or the number of row/column.
Because we have included column names in our dataset,
we can also reference columns using those names:

```{r}
# extract column by name
name_col1 <- clinical["tumor_stage"]
name_col2 <- clinical[ , "tumor_stage"]
```

Note the example above features quotation marks around the column name.
Without the quotation marks, 
R will assume we're attempting to reference an object.

As we discussed with subsetting based on index above,
the two objects created above differ in the data structure.
`name_col1` is a data frame (with one column),
while `name_col2` is a vector.
Although this difference in the type of object may not matter for your analysis,
it's useful to understand that there are multiple ways to accomplish a task,
each of which may make particular code work more easily.

There are additional ways to extract columns,
which use R specific for complex data objects,
and may be useful to recognize as your R skills progress.

The first is to use double square brackets:

```{r}
# double square brackets syntax
name_col3 <- clinical[["tumor_stage"]]
```

You can think of this approach as digging deeply into a complex object 
to retrieve data.

The final approach is equivalent to the last example,
but can be considered a shortcut since it requires fewer keystrokes
(no quotation marks, and only one symbol):

```{r}
# dollar sign syntax
name_col4 <- clinical$tumor_stage
```

Both of the last two approaches above return vectors.
For more information about these different ways of accessing parts of a data frame,
see [this article](https://www.r-bloggers.com/r-accessors-explained/).

The following challenges all use the `clinical` object:

> #### Challenge-days
> Code as many different ways possible to extract the column `days_to_death`.

> #### Challenge-rows
> Extract the first 6 rows for only `age_at_diagnosis` and `days_to_death`.

> #### Challenge-calculate
> Calculate the range and mean for `cigarettes_per_day`.

## Factors

**Note:** This section was written with a previous version of R that automatically interprets
all character data as factors
(this is not true of more recent versions of R).
To execute the code in this section,
please first import your data again,
using the following modified command:

```{r}
clinical <- read.csv("data/clinical.csv", stringsAsFactors = TRUE)
```


This section explores one of the trickier types of data you're likely to encounter:
factors, which are how R interprets categorical data.

When we imported our dataset into R,
the `read.csv` function assumed that all the character data
in our dataset are factors, or categories. 
Factors have predefined sets of values, called levels.
We can explore what this means by first creating a factor vector:

```{r}
# create vector with factor data
test_data <- factor(c("placebo", "test_drug", "placebo", "known_drug"))
# show factor
test_data
```

This vector includes four pieces of data 
(often referred to as items or elements),
which are printed as output above.
The second line of the output shows information about the levels,
or categories, of our vector.
We can also access this information separately,
which is useful if the data (vector) has a large number of elements:

```{r}
# show levels of factor
levels(test_data) 
```

The levels in this test dataset are currently listed in alphabetical order,
which is the default presentation in R.
The order of factors dictates how they are presented in subsequent analyses,
so there are definitely cases in which you may want the levels in a specific order. 
In the case of `test_data`, 
we may want to keep the two drug treatments together, 
with placebo at the end:

```{r}
# reorder factors to put placebo at end
test_data <- factor(test_data, levels = c("known_drug", "test_drug", "placebo"))
# show reordered
test_data
```

This doesn't change the data itself,
but does make it easier to manage the data later.

Another useful aspect of factors is that they are stored as integers with labels.
This means that you can easily convert them to numeric data:

```{r}
# converting factors to numeric
as.numeric(test_data)
```

This can be handy for some types of statistical analyses, 
and also illustrates the importance of ordering your levels appropriately.

We can apply this knowledge to our clinical dataset,
by first observing how the data are presented when creating a basic plot:

```{r race-plot}
# quick and dirty plot
plot(clinical$race)
```

The labels as presented by default are not particularly readable, 
and also lack appropriate capitalization and formatting.
While it is possible to modify only the plot labels,
we would have to do that for all of our subsequent analyses.
It is more efficient to modify the levels once:

```{r}
# assign race data to new object 
race <- clinical$race 
levels(race)
```

By assigning the data to a new object, 
we can more easily perform manipulations without altering the original dataset.

The output above shows the current levels for race.
We can access each level using their position in this order,
combined with our knowledge of square brackets for subsetting:

```{r}
levels(race)[1]
```

We can modify them to improve their formatting by assigning a new level
(name) of our choosing:

```{r}
# correct factor levels
levels(race)[1] <- "Am Indian"
levels(race)[2] <- "Asian" # capitalize asian
levels(race)[3] <- "black"
levels(race)[4] <- "Pac Isl"
levels(race)[5] <- "unknown"
# show revised levels
levels(race) 
```

Although we're not doing so here, 
we could also reorder the levels 
(as we did for `test_data`).

Once we are satisfied with the resulting levels,
we assign the modified factor back to the original dataset:

```{r better-race-plot}
# replace race in data frame
clinical$race <- race
# replot with corrected names
plot(clinical$race)
```

This section was a very brief introduction to factors,
and it's likely you'll need more information when working with categorical data of your own.
A good place to start would be [this article](https://peerj.com/preprints/3163/),
and exploring some of the tools in the tidyverse (which we'll discuss in the next lesson).

> #### Challenge-not-reported
> In your clinical dataset, 
> replace "not reported" in ethnicity with NA

> #### Challenge-remove
> What Google search helps you identify additional strategies for renaming missing data?

## Optional: Creating a data frame by hand

This last section shows two different approaches to creating a data frame by hand
(in other words, without importing the data from a spreadsheet).
It isn't particularly useful for most of your day-to-day work,
and also not a method you want to use often,
as this type of data entry can introduce errors.
However, it's frequently used in online tutorials, 
which can be confusing,
and also helps illustrate how data frames are composed.

The first approach is to create separate vectors (columns),
and then join them together in a second step:

```{r}
# create individual vectors
cancer <- c("lung", "prostate", "breast")
metastasis <- c("yes", "no", "yes")
cases <- c(30, 50, 100)
# combine vectors
example_df1 <- data.frame(cancer, metastasis, cases)
str(example_df1)
```

The resulting data frame has column headers,
identified from the names of the vectors combined together.

The next way seems more complex,
but represents the code above combined into one step:

```{r}
# create vectors and combine into data frame simultaneously
example_df2 <- data.frame(cancer = c("lung", "prostate", "breast"),
                          metastasis = c("yes", "no", "yes"),
                          cases = c(30, 50, 100), stringsAsFactors = FALSE)
str(example_df2)
```

As we learned above, 
factors can be particularly difficult, 
so it's useful to know that you can use `stringsAsFactors = FALSE` 
to import such data as character instead.

## Wrapping up

In this session, 
we learned to import data into R from a csv file,
learned multiple ways to access parts of data frames, 
and manipulated factors.

In the next session, 
we'll begin to explore a set of powerful, elegant data manipulation tools 
for data cleaning, transforming, and summarizing,
and we'll prepare some data to visualize in our final session.

**This document is written in [R markdown](http://rmarkdown.rstudio.com),**
which is a method of formatting text, code, and output to create documents that are sharable with other people. 
While this document is intended to serve as a reference for you to read while typing code into your own script,
you may also be interested in modifying and running code in the original R markdown file ([`class2.Rmd`](class2.Rmd) in the GitHub repository).

The course materials webpage is available [here](README.md).
Materials for all lessons in this course include:

- [Class 1](class1.md): R syntax, assigning objects, using functions
- [Class 2](class2.md): Data types and structures; slicing and subsetting data
- [Class 3](class3.md): Data manipulation with `dplyr`
- [Class 4](class4.md): Data visualization in `ggplot2`

## Extra exercises

Answers to all challenge exercises are available [here](solutions/). 

The following exercises all use the same `clinical` data from this class.

#### Challenge-disease-race
Extract the last 100 rows for only `disease` and `rac`e and save to an object called `disease_race`.

#### Challenge-min-max
Calculate the minimum and maximum for `days_to_death`.

#### Challenge-factors
Change all of the factors of race to shorter names for each category,
and appropriately indicate missing data.