-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathintro.Rmd
131 lines (76 loc) · 11.1 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
```{r include = FALSE}
if(!knitr:::is_html_output())
{
options("width"=56)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=56, indent = 2), tidy = TRUE)
knitr::opts_chunk$set(fig.pos = 'H')
}
```
# Introduction {#intro}
This chapter gives a brief intro into what R is and whether you should use it. For many of you, you may already be aware of the vast majority of this information, but for those who are brand new to R or even to data analysis and programming, this is a good place to start.
## What is R?
R is a public licence programming language. More specifically, it's a statistical programming language meaning that it's often used for statistical analysis rather than software development. R is also a functional programming, rather than an object-oriented programming language like Python. This means that operation in R are primarily performed by functions (input, do something, output), but more about that later.
Strictly speaking, R is not just a functional programming language. In reality, a language is never purely one type and R is no exception. There are object-oriented systems in R (three main ones), meaning that object-oriented programming is possible and relatively straightforward in R.
So basically, R is a functional programming language with some object-oriented systems. If that means very little to you, don't worry. For the vast majority of users, this is a purely academic definition.
One important attribute about R however that may affect you, is that R is an interpreted language. This essentially means that when you send someone some R code, they need R installed to be able to run it. This means that making full programs is difficult. In the [opeRate](https://operate.arawles.co.uk) book that follows this one, we'll look at the `shiny` package, which can be used to quickly make web apps based on R code. These apps are no different in the sense that they also need R installed to be able to run, but because they are web-based, they are significantly easier to share.
For the most part then, if you want to share R code with colleagues, they'll need to have R installed as well.
## Should I use R?
People will forever argue about which is better, R or Python or Java or C or writing down mathematical equations on a piece of paper and handing it to a monkey to solve. I imagine you're reading this because you heard that R was good for data analysis, and it absolutely is. And so is Python. They're just... different. Personally, I prefer to use R but I understand that other people don't.
Importantly though, never feel as though you've missed a trick by picking a particular language. Programming is not just a practice, it's a way of thinking, and experience is almost always transferable across languages.
For reference however, here are a few of the things that you can use R for:
* data analysis
* statistics and machine learning
* reporting and writing technical documents
* web apps
* text analysis
If you're interested in any of these, then you're in the right place.
## Using R
R is very simple. There is a console where you type commands and get responses. Like the classic command-line interfaces you see when the stereotypical nerd has to hack into the FBI database, you type commands, one at a time into the console, R processes it, and then produces a response if appropriate. For example, if you type `2 + 2` into the R console and hit enter, you'll get `4`.
Writing commands out one at a time can be quite time-consuming if you want to make changes however. So we use scripts to store multiple lines of code that can then be run altogether. When you execute a script, each line gets passed one by one to the console and executed. For example, I might make a script with this code:
```{r}
variable1 <- 2 + 2
variable1 / 10
```
So when I run the script, it will run the first line, then the second without me having to type anything else in.
## RStudio
RStudio is separate from R. R is a programming language and RStudio is an integrated development environment or *IDE*. This means that RStudio doesn't actually run any code, it just passes it to R for you, meaning that you'll need R to really use RStudio.
RStudio is a massive part of how you interact with R however. For example, with the exception of a few days when I was waiting for RStudio to be installed, I can't ever remember using R without RStudio.
In the [previous section](#using-r), we talked very briefly of the R console and scripts. RStudio helps with this workflow. It makes it easier to create scripts, providing extra tools to help write code quicker, and then acts as a window to R when you want to execute the script.
RStudio is provided by a company called Posit. Previously, Posit was also called 'RStudio', which led to some confusion between the product and the company, hence the name change to Posit. But you may occasionally see the company still referred to as 'RStudio', so just bear in mind that it's different to the IDE we're describing here.
### What is an IDE?
At its simplest definition, an IDE helps you get work done in your programming language of choice. It can help you save blocks of code, organise projects, save plots and everything in between. R comes with a basic user interface when you install it, but RStudio provides lot more functionality to help you interact with the R console.
### Using RStudio
RStudio is an IDE that is under active developement and so I won't go through a tutorial here for fear that it will soon become outdated. The [RStudio website](http://rstudio.com) has lots of tutorials and support to help you get to grips with RStudio, but we'll look at a couple of concepts now to get you started.
#### Console
One of the panes in RStudio will be the R console. This is where your commands are actually executed. When you see a `>` at the console, that means it's waiting for commands. If you don't see one, it usually means that R is processing something so give it a second. If you see a `+` at the console it means that your previous line wasn't a complete command and so R is waiting for you to finish it off. The best thing when you see that is to press "Esc" and check your command to make sure it's syntactically valid.
#### Scripts
Typing directly into the console is okay for quick interactive use, but soon you'll be doing more complicated operations that will span multiple lines. To make things easier, RStudio has a **scripts** pane. This is where you can write .R files that store R commmands. You then highlight the lines you want to execute and click "Run" or press Ctrl+Enter on Windows and the lines will be passed one at a time to the console where they are executed.
#### Environment
The **environment** pane shows you all the objects that are in your current environment. Objects and environments aren't concepts we've looked at yet, so for now just think of this pane as a window to any variables you create.
#### Files/plots/help
These panes are fairly self-explanatory. The **files** pane gives you a view to the files on your computer, the **plots** tab shows you any plots that you create, and the **help** pane can be used to look up functions or topics.
## Packages
As we know, R is a functional programming language, meaning that we rely on functions to do our work for us. And when you install R, you'll have access to thousands of functions that come bundled with it. However, these functions have been chosen because they're more generalisable and basic. Including functions for everything that could be done in R with the base version would result in it being unnecessarily large.
So instead of them being available from the start, people create sets of functions that usually are used for a particular tasks and then distribute them as a *package*. You can then install this package and have access to all these great functions that someone has written for you.
Some great examples of R packages are:
* [ggplot2](https://ggplot2.tidyverse.org/) for creating plots
* [dplyr](https://dplyr.tidyverse.org/) for data manipulation
* [shiny](https://shiny.rstudio.com/) for creating web apps
The thing to remember is that R has a fantastic open source community and if you need to do something in R, somebody has probably written a package to help you out.
### Installing packages
You can think of installing a package as a bit like installing a program on your computer. You only need to install it once, but then you'll need to open it each time you use it.
Installing packages is really easy; you just use the `install.packages()` function:
```{r, eval = FALSE}
install.packages("ggplot2")
```
You can choose where the package installs by supplying a path to the `lib` parameter (e.g. `lib = "C:/me/desktop"`), but by default it will install it into your default library folder. You can find the path to this default library folder with the `.libPaths()` function.
Once the package has been installed, you only need to reinstall it if there's a problem or you want to update it. Otherwise, you just need to load it every time you want to use it. The logic behind this is that you may have hundreds of packages installed and you don't need all of them for every project you do. So instead, we load specific packages we want each time.
To load a package, use the `library()` function. Place this somewhere near the start of your script so that it's obvious which packages someone will need if they're reading your code and want to do it for themselves. This will then load in all of the functions from that package for you to use.
But Adam, what happens if I load two packages that have functions with the same name? Ah, that's a great question. Later in the book ([Environments](#environments)) we're going to look at exactly how R deals with this issues, but I'll give a simple explanation for now. When R loads your packages, it will do it in a specific order. The later on the package is loaded, the higher it's precedence. That means that when you try and use a function with a name that's in more than one of the packages you've loaded, it will default to the latest package you've loaded.
To help avoid these situations, there are some things in place. Firstly, when you load a package that includes function names that are already used, you'll get a conflict warning. Secondly, to avoid this confusion altogether, you can be explicit about which package your function came from. To do that, just prefix your function with the package name and `::` when you use it, like this:
```{r, eval = FALSE}
dplyr::mutate()
mutate()
```
Finally, there is a package called `conflicted` that you can use to avoid these issues. Rather than giving a certain package precedence because it was loaded later, attempting to use a function that could come from more than one package will cause an error and you'll need to be explicit with which one you mean.
Personally, I use the prefix method. Yes it's a little bit more verbose, but it makes it 100 times easier for someone else to know exactly where your function has come from without the need for extra packages. This also tends to be the approach that I take because there's nothing worse than coming back to a script a year later and forgetting which packages you need because you didn't include the correct library calls.