-
Notifications
You must be signed in to change notification settings - Fork 13
/
class2.Rmd
599 lines (462 loc) · 20.3 KB
/
class2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
---
title: "Introduction to R, Class 2: Working with data"
output: github_document
---
<!--class2.md is generated from class2.Rmd. Please edit that file -->
```{r setup, include=FALSE, purl=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Objectives
In the last section,
we learned some fundamental principles of R and how to work in RStudio.
In this session,
we'll continue our introduction to R by working with a large dataset
that more closely resembles that which you may encounter while analyzing data for research.
By the end of this session, you should be able to:
- import spreadsheet-style data into R as a data frame
- extract portions of data from a data frame
- manipulate factors (categorical data)
## Importing spreadsheet-style data into R
Open RStudio, and we'll check to make sure you're ready to start work again.
You can check to see if you're working in your project directory
by looking at the top of the Console.
You should see the path (location in your computer)
for the project directory you created last time
(e.g., `~/Desktop/intro_r`).
If you do not see the path to your project directory,
go to `File -> Open Project` and navigate to the location of your project directory.
Alternatively, using your operating system's file browser,
double click on the `r_intro.Rrpoj` file.
Create a new R script (`File -> New File -> R Script`)
and save it in your project directory with the name `class2.R`.
Place the following comment on the top line as a title:
`# Introduction to R: Class 2`
In the last session, we recommended organizing your work in
directories (folders) according to projects.
While a thorough discussion of project organization is beyond the scope of this class,
we will continue to model best practices by creating a directory to hold our data:
```{r eval=FALSE}
# make a directory
dir.create("data")
```
```{r include=FALSE, purl=FALSE}
# code above not run, so this hidden code creates the directory if necessary
if (!dir.exists("data")) {dir.create("data")}
```
You should see the new directory appear in your project directory,
in the lower right panel in RStudio.
There is also a button in that panel you can use to create a new folder,
but including the code to perform this task makes other people
(and yourself) able to reproduce your work more easily.
Now that we have a place to store our data,
we can go ahead and download the dataset:
```{r}
# download data from url
download.file("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv", "data/clinical.csv")
```
The code above has two arguments,
both encompassed in quotation marks:
first, you indicate where the data can be found online.
Second, you indicate where R should store a copy of the file on your own computer.
The output from that command may look alarming,
but it represents information confirming it worked.
You can click on the `data` folder to ensure the file is now present.
Notice that the URL above ends in `clinical.csv`,
which is also the name we used to save the file on our computers.
If you click on the URL and view it in a web browser,
the format isn't particularly easy for us to understand.
You can also view the file by clicking on it in the lower right hand panel,
then selecting "View File."
> The option to "Import Dataset" you see after clicking on the file
> references some additional tools present in RStudio
> that can assist with various kinds of data import.
> Because this requires installing additional software,
> complete exploration of these options is outside the scop of this class.
> For more information, check out [this article](https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio).
The data we've downloaded are in csv format,
which stands for "comma separated values."
This means the data are organized into rows and columns,
with columns separated by commas.
These data are arranged in a tidy format,
meaning each row represents an observation,
and each column represents a variable (piece of data for each observation).
Moreover, only one piece of data is entered in each cell.
Now that the data are downloaded,
we can import the data and assign to an object:
```{r}
# import data and assign to object
clinical <- read.csv("data/clinical.csv")
```
You should see `clinical` appear in the Environment window on the upper right panel in RStudio.
If you click on `clinical` there,
a new tab will appear next to your R script in the Source window.
> Clicking on the name of an object in the Environment window
> is a shortcut for running `View(clinical)`;
> you'll see this code appear in the Console after clicking.
Now that we have the data imported and assigned to an object,
we can take some time to explore the data we'll be using for the rest of this course:
- These data are clinical cancer data from the [National Cancer Institute's Genomic Data Commons](https://gdc.cancer.gov),
specifically from The Cancer Genome Atlas, or [TCGA](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga).
- Each row represents a patient,
and each column represents information about demographics (race, age at diagnosis, etc)
and disease (e.g., cancer type).
- The data were downloaded and aggregated using an R script,
which you can view in the [GitHub repository for this course](https://github.com/fredhutchio/R_intro/blob/master/extra/clinical_data.R).
The function we used to import the data is one of a family of commands used to import the data.
Check out the help documentation for `read.csv` for more options for importing data.
> You can also import data directly into R using `read.csv`,
> using `clinical <- read.csv("https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.csv")`.
> For these lessons, we model downloading and importing in two steps,
> so you retain a copy of the data on your computer.
> This reflects how you're likely to import your own data,
> as well as recommended practice for retaining data used in an analysis (since data online may be updated).
> #### Challenge-data
Download, inspect, and import the following data files.
The URL for each sample dataset is included along with a name to assign to the object.
(Hint: you can use the same function as above,
but may need to update the `sep = ` parameter)
- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.tsv, object name: `example1`
- URL: https://raw.githubusercontent.com/fredhutchio/R_intro/master/extra/clinical.txt, object name: `example2`
Importing data can be tricky and frustrating,
However, if you can't get your data into R,
you can't do anything to analyze or visualize it.
It's worth understanding how to do it effectively to save you time and energy later.
## Data frames
Now that we have data imported and available,
we can start to inspect the data more closely.
These data have been interpreted by R to be a data frame,
which is a data structure (way of organizing data) that is analogous to tabular or spreadsheet style data.
By definition, a data frame is a table made of vectors (columns) of all the same length.
As we learned in our last session,
a vector needs to include all of the same type of data (e.g., character, numeric).
A data frame, however,
can include vectors (columns) of different data types.
To learn more about this data frame,
we'll first explore its dimensions:
```{r}
# assess size of data frame
dim(clinical)
```
The output reflects the number of rows first (6832),
then the number of columns (20).
We can also preview the content by showing the first few rows:
```{r}
# preview first few rows
head(clinical)
```
The default number of rows shown is six.
You can specify a different number using the `n = ` parameter,
demonstrated below using `tail`,
which shows the last few rows
```{r}
# show last three rows
tail(clinical, n = 3)
```
We often need to reference the names of columns,
so it's useful to print only those to the screen:
```{r}
# view column names
names(clinical)
```
It's also possible to view row names using`rownames(clinical)`,
but our data only possess numbers for row names so it's not very informative.
As we learned last time,
we can use `str` to provide a general overview of the object:
```{r}
# show overview of object
str(clinical)
```
The output provided includes:
- data structure: data frame
- dimensions: 6832 rows and 20 columns
- column-by-column information: each prefaced with a `$`, and includes the column name, data type (num, int, Factor)
> Factors are how character data are interpreted by R in data frames.
> We'll talk more about working with factors at the end of this lesson.
Finally, we can also examine basic summary statistics for each column:
```{r}
# provide summary statistics for each column
summary(clinical)
```
For numeric data (such as `year_of_death`),
this output includes common statistics like median and mean,
as well as the number of rows (patients) with missing data (as `NA`).
For factors (character data, such as `disease`),
you're given a count of the number of times the top six most frequent factors (categories)
occur in the data frame.
## Subsetting data frames
Now that our data are available for use, we can begin extracting relevant information from them.
```{r}
# extract first column and assign to a variable
first_column <- clinical[1]
```
As discussed last time with vectors,
the square brackets (`[ ]`) are used to subset,
or reference part of,
a data frame.
You can inspect the output object by clicking on it in the environment.
It contains all of the rows for only the first column.
When a single number is included in the square brackets,
R assumes you are referencing a column.
When you include two numbers in square brackets separated by a comma,
R assumes the **first number references the row** and the **second number references the column** you desire.
This means you can also reference the first column as follows:
```{r}
# extract first column
first_column_again <- clinical[ , 1]
```
Leaving one field blank means you want the entire set in the output
(in this case, all rows).
> #### Challenge-extract
> What is the difference in results between the last two lines of code?
Similarly, we can also extract only the first row across all columns:
```{r}
# extract first row
first_row <- clinical[1, ]
```
We can also extract slices, or sections of rows and columns,
such as a single cell:
```{r}
# extract cell from first row of first column
single_cell <- clinical[1,1]
```
To extract a range of cells,
we use the same colon (`:`) syntax from last time:
```{r}
# extract a range of cells, rows 1 to 3, second column
range_cells <- clinical[1:3, 2]
```
This works for ranges of columns as well.
We can also exclude particular parts of the dataset using a minus sign:
```{r}
# exclude first column
exclude_col <- clinical[ , -1]
```
Combining what we know about R syntax,
we can also exclude a range of cells using the `c` function:
```{r}
# exclude first 100 rows
exclude_range <- clinical[-c(1:100), ]
```
So far, we've been referencing parts of the dataset based on index position,
or the number of row/column.
Because we have included column names in our dataset,
we can also reference columns using those names:
```{r}
# extract column by name
name_col1 <- clinical["tumor_stage"]
name_col2 <- clinical[ , "tumor_stage"]
```
Note the example above features quotation marks around the column name.
Without the quotation marks,
R will assume we're attempting to reference an object.
As we discussed with subsetting based on index above,
the two objects created above differ in the data structure.
`name_col1` is a data frame (with one column),
while `name_col2` is a vector.
Although this difference in the type of object may not matter for your analysis,
it's useful to understand that there are multiple ways to accomplish a task,
each of which may make particular code work more easily.
There are additional ways to extract columns,
which use R specific for complex data objects,
and may be useful to recognize as your R skills progress.
The first is to use double square brackets:
```{r}
# double square brackets syntax
name_col3 <- clinical[["tumor_stage"]]
```
You can think of this approach as digging deeply into a complex object
to retrieve data.
The final approach is equivalent to the last example,
but can be considered a shortcut since it requires fewer keystrokes
(no quotation marks, and only one symbol):
```{r}
# dollar sign syntax
name_col4 <- clinical$tumor_stage
```
Both of the last two approaches above return vectors.
For more information about these different ways of accessing parts of a data frame,
see [this article](https://www.r-bloggers.com/r-accessors-explained/).
The following challenges all use the `clinical` object:
> #### Challenge-days
> Code as many different ways possible to extract the column `days_to_death`.
> #### Challenge-rows
> Extract the first 6 rows for only `age_at_diagnosis` and `days_to_death`.
> #### Challenge-calculate
> Calculate the range and mean for `cigarettes_per_day`.
## Factors
**Note:** This section was written with a previous version of R that automatically interprets
all character data as factors
(this is not true of more recent versions of R).
To execute the code in this section,
please first import your data again,
using the following modified command:
```{r}
clinical <- read.csv("data/clinical.csv", stringsAsFactors = TRUE)
```
This section explores one of the trickier types of data you're likely to encounter:
factors, which are how R interprets categorical data.
When we imported our dataset into R,
the `read.csv` function assumed that all the character data
in our dataset are factors, or categories.
Factors have predefined sets of values, called levels.
We can explore what this means by first creating a factor vector:
```{r}
# create vector with factor data
test_data <- factor(c("placebo", "test_drug", "placebo", "known_drug"))
# show factor
test_data
```
This vector includes four pieces of data
(often referred to as items or elements),
which are printed as output above.
The second line of the output shows information about the levels,
or categories, of our vector.
We can also access this information separately,
which is useful if the data (vector) has a large number of elements:
```{r}
# show levels of factor
levels(test_data)
```
The levels in this test dataset are currently listed in alphabetical order,
which is the default presentation in R.
The order of factors dictates how they are presented in subsequent analyses,
so there are definitely cases in which you may want the levels in a specific order.
In the case of `test_data`,
we may want to keep the two drug treatments together,
with placebo at the end:
```{r}
# reorder factors to put placebo at end
test_data <- factor(test_data, levels = c("known_drug", "test_drug", "placebo"))
# show reordered
test_data
```
This doesn't change the data itself,
but does make it easier to manage the data later.
Another useful aspect of factors is that they are stored as integers with labels.
This means that you can easily convert them to numeric data:
```{r}
# converting factors to numeric
as.numeric(test_data)
```
This can be handy for some types of statistical analyses,
and also illustrates the importance of ordering your levels appropriately.
We can apply this knowledge to our clinical dataset,
by first observing how the data are presented when creating a basic plot:
```{r race-plot}
# quick and dirty plot
plot(clinical$race)
```
The labels as presented by default are not particularly readable,
and also lack appropriate capitalization and formatting.
While it is possible to modify only the plot labels,
we would have to do that for all of our subsequent analyses.
It is more efficient to modify the levels once:
```{r}
# assign race data to new object
race <- clinical$race
levels(race)
```
By assigning the data to a new object,
we can more easily perform manipulations without altering the original dataset.
The output above shows the current levels for race.
We can access each level using their position in this order,
combined with our knowledge of square brackets for subsetting:
```{r}
levels(race)[1]
```
We can modify them to improve their formatting by assigning a new level
(name) of our choosing:
```{r}
# correct factor levels
levels(race)[1] <- "Am Indian"
levels(race)[2] <- "Asian" # capitalize asian
levels(race)[3] <- "black"
levels(race)[4] <- "Pac Isl"
levels(race)[5] <- "unknown"
# show revised levels
levels(race)
```
Although we're not doing so here,
we could also reorder the levels
(as we did for `test_data`).
Once we are satisfied with the resulting levels,
we assign the modified factor back to the original dataset:
```{r better-race-plot}
# replace race in data frame
clinical$race <- race
# replot with corrected names
plot(clinical$race)
```
This section was a very brief introduction to factors,
and it's likely you'll need more information when working with categorical data of your own.
A good place to start would be [this article](https://peerj.com/preprints/3163/),
and exploring some of the tools in the tidyverse (which we'll discuss in the next lesson).
> #### Challenge-not-reported
> In your clinical dataset,
> replace "not reported" in ethnicity with NA
> #### Challenge-remove
> What Google search helps you identify additional strategies for renaming missing data?
## Optional: Creating a data frame by hand
This last section shows two different approaches to creating a data frame by hand
(in other words, without importing the data from a spreadsheet).
It isn't particularly useful for most of your day-to-day work,
and also not a method you want to use often,
as this type of data entry can introduce errors.
However, it's frequently used in online tutorials,
which can be confusing,
and also helps illustrate how data frames are composed.
The first approach is to create separate vectors (columns),
and then join them together in a second step:
```{r}
# create individual vectors
cancer <- c("lung", "prostate", "breast")
metastasis <- c("yes", "no", "yes")
cases <- c(30, 50, 100)
# combine vectors
example_df1 <- data.frame(cancer, metastasis, cases)
str(example_df1)
```
The resulting data frame has column headers,
identified from the names of the vectors combined together.
The next way seems more complex,
but represents the code above combined into one step:
```{r}
# create vectors and combine into data frame simultaneously
example_df2 <- data.frame(cancer = c("lung", "prostate", "breast"),
metastasis = c("yes", "no", "yes"),
cases = c(30, 50, 100), stringsAsFactors = FALSE)
str(example_df2)
```
As we learned above,
factors can be particularly difficult,
so it's useful to know that you can use `stringsAsFactors = FALSE`
to import such data as character instead.
## Wrapping up
In this session,
we learned to import data into R from a csv file,
learned multiple ways to access parts of data frames,
and manipulated factors.
In the next session,
we'll begin to explore a set of powerful, elegant data manipulation tools
for data cleaning, transforming, and summarizing,
and we'll prepare some data to visualize in our final session.
**This document is written in [R markdown](http://rmarkdown.rstudio.com),**
which is a method of formatting text, code, and output to create documents that are sharable with other people.
While this document is intended to serve as a reference for you to read while typing code into your own script,
you may also be interested in modifying and running code in the original R markdown file ([`class2.Rmd`](class2.Rmd) in the GitHub repository).
The course materials webpage is available [here](README.md).
Materials for all lessons in this course include:
- [Class 1](class1.md): R syntax, assigning objects, using functions
- [Class 2](class2.md): Data types and structures; slicing and subsetting data
- [Class 3](class3.md): Data manipulation with `dplyr`
- [Class 4](class4.md): Data visualization in `ggplot2`
## Extra exercises
Answers to all challenge exercises are available [here](solutions/).
The following exercises all use the same `clinical` data from this class.
#### Challenge-disease-race
Extract the last 100 rows for only `disease` and `rac`e and save to an object called `disease_race`.
#### Challenge-min-max
Calculate the minimum and maximum for `days_to_death`.
#### Challenge-factors
Change all of the factors of race to shorter names for each category,
and appropriately indicate missing data.