title | author | date | output | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R intro |
|
31 December 2018 |
|
R: To use R, navigate your browser to cran.r-project.org.^[CRAN stands for Comprehensive R Archive Network. It is the central repository for downloading R itself and (vetted) packages.] Download. You're ready to use.
RStudio: Most R users interact with R through an (amazing) IDE called "RStudio". Navigate to https://www.rstudio.com/products/rstudio/ and download the desktop IDE. Now you're really ready.
Relative to Stata, R introduces a few new dimensions:
- R is free.
- R is an object-oriented language, in which objects have types.
- R uses packages (a.k.a. libraries).
- R tries to guess what you meant.
- R easily (and infinitely) parallelizes.
- R makes it easy to work with matrices.
- R plays nicely with with Markdown.
Let's review these in differences in more depth.
Free to use. Free to update and upgrade. Free to dissimenate. Free for your students to install on their own laptops.
You know the old joke about an economist being told there is a $100 bill lying on the sidewalk? ("Impossible! Someone would have picked it up already.") Now think about the crazy license fees for proprietary econometrics and modelling software. You can see where this is going...
You might have heard or read something along the lines of: "In R, everything has a name and everything is an object". This probably sounds very abstract if you're coming from a language like Stata. However, the key practical implications of this so-called object-oriented (OO) approach are as follows:
- You hold many objects in memory at the same time.
- This could include multiple data frames, scalars, lists, functions, etc. (Remember: everything in R is an object.)
- One of the upshots is no more "preserve", "snapshot", "restore" Stata-eque hackery if you want to summarise some variables in your dataset, or have multiple datasets that you want to work on at the same time.
- As a corollary of this, defining or naming objects is a thing:
a <- 3
(i.e. the objecta
has been assigned as a scalar — or single-length vector — equal to 3)b <- matrix(1:4, nrow = 2)
(i.e. the objectb
has been assigned as a 2x2 matrix)- Side note: the
<-
assignment operator is read aloud as "gets". You can also use a regular old equal sign if you prefer, e.g.a = 3
.
- Object types matter: e.g., a
matrix
is a bit different fromdata.frame
or avector
. More.
All of this might sound simple -- and it is! -- but one aspect of the OO approach that can trip up new R users (especially those coming from Stata) is that you have to be specific about which object you are referring to.
- In Stata, because there is only ever one dataset in memory, there can be no ambiguity about which variable you are referring to (or, more correctly, where Stata should look for it).
- However, because you can have multiple data frames in memory in R, you typically have to tell it that you want the variable "wage" from, say, dataframe1 and not from dataframe2.
- There are various ways to do this and it soon becomes second nature.
- E.g. You could use the
$
index operator:dataframe1$wage
. - E.g. Some functions let you specify the data frame (or parent object) as part of the function call. We'll see some practical examples of this approach in the next section on regression models.
- E.g. You could use the
- Just as LaTeX uses packages (i.e.,
\usepackage{foo}
), R also draws upon non-default packages (i.e.,library(foo)
). - Note that R automatically loads with a set of default packages called the
base
installation, which includes the most commonly used packages and functions across all use cases (core probability and statistical operations, linear regression functions, etc.). - However, to really become effective in R, you will need to install and use non-default packages too.
- Seriously, R intends for you to make use of outside packages. Don't constrain yourself.^[If you want to get really meta: the
pacman
package helps you... manage packages. More.]
- Seriously, R intends for you to make use of outside packages. Don't constrain yourself.^[If you want to get really meta: the
Install a package: install.packages("package.name")
- Notice that the installed package's name is in quotes.^[R uses both single (
'word'
) and double quotes ("word"
) to reference characters (strings).] - You generally only need to install a package once. That is, assuming you use the
update.packages()
command to update all of your installed packages at once (see below).
Load a package: library(package.name)
- Notice that you don't need quotation marks now. Reason: Once you have installed the package, R treats it as an object rather than a character.
- You will need to load any non-base package that you want to use at the start of a new R session.
Update packages: update.packages(ask=FALSE)
- This command will update all of your installed packages simultaneously. If you want to only update a specific package, you should simply reinstall it (i.e.
install.packages("package.name")
)
If you don't feel like typing in these commands manually, one of the many advantages of the RStudio IDE is that makes installing and updating packages very easy (autocompletion, package search, etc.). Just click on the "Packages" tab of bottom-right panel:
R is friendly and tries to help if you weren't specific enough. Consider the following hypothetical OLS regression, where lm()
is just the workhorse function for linear models in R:
lm(wage ~ education + gender, data = dataframe1)
Here, we could use a string variable like gender
(which takes values like "female"
and "male"
) directly in our regression call. R knows what you mean: you want indicator variables for the levels of the variable.^[Variables in R that have different qualitative levels are known as "factors" Behind the scenes, R is converting gender
from a string to a factor for you, although you can also do this explicitly yourself. More.]
Mostly, this is a good thing, but sometimes R's desire to help can hide programming mistakes and idiosyncrasies. So it's best to be aware, e.g.:
TRUE + TRUE
## [1] 2
Parallelization in R is easily done thanks to various packages like parallel
, pbapply
, future
, and foreach
.
Let's illustrate by way of a simulation. First we'll create some data (our_data
) and a function (our_reg
), which draws a sample of 10,000 observations and runs a regression.
# Set our seed
set.seed(12345)
# Set sample size
n <- 1e6
# Generate 'x' and 'e'
our_data <- data.frame(x = rnorm(n), e = rnorm(n))
# Calculate 'y'
our_data$y <- 3 + 2 * our_data$x + our_data$e
# Function that draws a sample of 10,000 observations and runs a regression
our_reg <- function(i) {
# Sample the data
sample_data <- our_data[sample.int(n = n, size = 1e4, replace = T),]
# Run the regression
lm(y ~ x, data = sample_data)$coef[2]
}
With our data and function created, let's run the simulation without parallelization:
library(tictoc) ## For convenient timing
set.seed(1234) ## Optional. (Ensures results are exactly the same.)
tic()
# 1,000-iteration simulation
sim1 <- lapply(X = 1:1e4, FUN = our_reg)
toc()
## 73.576 sec elapsed
Now run the simulation with parallelization (12 cores):
library(pbapply) ## Adds progress bar and parallel options
set.seed(1234) ## Optional. (Ensures results are exactly the same.)
tic()
# 1,000-iteration simulation
sim2 <- pblapply(X = 1:1e4, FUN = our_reg, cl = 12)
toc()
## 18.125 sec elapsed
Not only was this about four times faster^[It's not a full 12 times faster because of the overhead needed to run this code in parallel (among other things). Since this overhead is largely a sunk cost, the relative speed-up will improve as we increase the number of iterations.], but notice how little the syntax changed to run the parallel version. To highlight the differences in bold: pblapply(X = 1:1e4, FUN = our_reg**, cl = 12**)
.
Here's another parallel option just to drive home the point. (In R, there are almost always multiple ways to get a particular job done.)
library(future.apply) ## Another option.
plan(multiprocess)
set.seed(1234) ## Optional. (Ensures results are exactly the same.)
tic()
# 1,000-iteration simulation
sim3 <- future_lapply(X = 1:1e4, FUN = our_reg)
toc()
## 17.942 sec elapsed
Further, many packages in R default (or have options) to work in parallel. E.g., the regression package lfe
uses the available processing power to estimate fixed-effect models.
Again, all of this extra parallelization functionality comes for free. In contrast, have you looked up the cost of a Stata/MP license recently? (Nevermind that you effectively pay per core!)
Note: This parallelization often means that you move away from for
loops and toward parallelized replacements (e.g., lapply
has many parallelized implementations).^[Though there are parallelized for
loop versions. More.]
Because R began its life as a statistical language/environment, it plays very nicely with matrices.
Create a matrix:
## The "c()" stands for "concatenate" and is used to bind together a sequence of numbers or strings.
matrix(data = c(3, 2, 3, 5, 9, 4, 3, 2, 7), ncol = 3)
## [,1] [,2] [,3]
## [1,] 3 5 3
## [2,] 2 9 2
## [3,] 3 4 7
Assign (store) a matrix:
A <- matrix(data = c(3, 2, 3, 5, 9, 4, 3, 2, 7), ncol = 3)
Invert a matrix:
solve(A)
## [,1] [,2] [,3]
## [1,] 0.8088235 -0.33823529 -0.25
## [2,] -0.1176471 0.17647059 0.00
## [3,] -0.2794118 0.04411765 0.25
Notebooks, websites, presentations, etc. can all easily include:
code chunks,
# Some amazing code
a = 2
b = 10
a^(b/a)
## [1] 32
evaluated code,
"Ernie" > "Burt"
## [1] TRUE
normal or mathematical text,
and even interactive content like leaflet
maps.
library(leaflet)
leaflet() %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=-123.075, lat=44.045, popup="The University of Oregon")
Yes, Stata 15 has some Markdown support, but the difference in functionality is pretty stark.
Now that you (hopefully) have a better sense of R, let's head over to the regression intro section to try some hands-on examples.