-
Notifications
You must be signed in to change notification settings - Fork 118
/
chapter-abstracts_Advanced_R_Solutions.Rmd
137 lines (84 loc) · 24.9 KB
/
chapter-abstracts_Advanced_R_Solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Meta data (for Advanced R Solutions)
# Template:
# * Start with in chapter <no> we discuss <take sth from the intro>.
# * <Then use a bit more from the intro section of a chapter, e.g. why it's important. What's the goal>
# * Then use the subsection names and the stuff from the "Outline" to sequentially refer to what is done in the chapter
# * 150-200 words per chapter
# * Don't use R code within the abstracts
## 2. Names and values
The second chapter explains the distinction between an object and its name. This distinction is important, as it helps:
* More accurately predicting the performance and memory usage of R code.
* Writeing faster code by avoiding accidental copies, a major source of slow code.
* Better understanding R’s functional programming tools.
In section "Binding basics", we introduce the distinction between names and values, and how bindings, or references, between a name and a value are created. Then, in "Copy-on-modify", we learn when R makes a copy and how to use tracemem to figure out when a copy actually occurs. We also learn about the implications for different R objects. In "Object size" we explore how much memory an object occupies. "Modify-in-place" describes two exceptions to copy-on-modify: environments and values with a single name are actually modified in place. Lastly, we conclude this chapter with a discussion of the garbage collector, in "Unbinding and the garbage collector".
## 3. Vectors
The third chapter discusses the most important family of data types in base R: vectors. It won’t cover individual vector types in too much detail, but it will show how all the types fit together as a whole.
Vectors come in two flavours: atomic vectors and lists. We start with "Atomic vectors", which are R’s simplest data structures. For atomic vectors, all elements must have the same type. Then, we take a small detour to discuss "Attributes", R’s flexible metadata specification. "S3 atomic vectors" discusses the important vector types that are built by combining atomic vectors with special attributes. These include factors, dates, date-times, and durations. Next, we dive into "Lists". Lists are very similar to atomic vectors. However, an element of a list can be any data type, including another list. This makes them suitable for representing hierarchical data. "Data frames and tibbles" teaches you about data frames and tibbles, which are used to represent rectangular data. They combine the behaviour of lists and matrices to make a structure ideally suited for the needs of statistical data. To finish up this chapter, we briefly talk about one final important data structure that’s closely related to vectors: "NULL".
## 4. Subsetting
In chapter 4 we discuss R's fast and powerful subsetting operators. Mastering them allows you to succinctly perform complex operations in a way that few other languages can match. Subsetting in R is easy to learn but hard to master because you need to internalise a number of interrelated concepts.
"Selecting multiple elements" starts by teaching you about [. You’ll learn the six ways to subset atomic vectors. You’ll then learn how those six ways act when used to subset lists, matrices, and data frames. "Selecting a single element" expands your knowledge of subsetting operators to include [[ and $ and focuses on the important principles of simplifying versus preserving. "Applications" leads you through eight important, but not obvious, applications of subsetting to solve problems that you often encounter in data analysis.
## 5. Control Flow
In chapter 5 you'll learn about the two primary tools of control flow: choices and loops. Choices, like if statements and switch() calls, allow you to run different code depending on the input. Loops, like for and while, allow you to repeatedly run code, typically with changing options.
We’d expect that you’re already familiar with the basics of these functions so we’ll briefly cover some technical details and then introduce some useful, but lesser known, features. The condition system (messages, warnings, and errors), which you’ll learn about in Chapter 8, also provides non-local control flow.
In "Choices" we dive into the details of the if statement, and then discuss the close relatives ifelse() and switch(). "Loops" starts off by reminding you of the basic structure of the for loop in R, discusses some common pitfalls, and then talks about the related while and repeat statements.
## 6. Functions
In chapter 6, you’ll learn how to turn your informal, working knowledge about functions into more rigorous, theoretical understanding. The tricks and techniques you will learn along the way will be important for understanding the more advanced topics discussed later in the book.
"Function fundamentals" describes the basics of creating a function, the three main components of a function, and the exception to many function rules: primitive functions (which are implemented in C, not R). "Lexical scoping" shows you how R finds the value associated with a given name, i.e. the rules of lexical scoping. "Lazy evaluation" is devoted to an important property of function arguments: they are only evaluated when used for the first time. "... (dot-dot-dot)" discusses the special ... argument, which allows you to pass on extra arguments to another function. "Exiting a function" discusses the two primary ways that a function can exit, and how to define an exit handler, code that is run on exit, regardless of what triggers it. "Function forms" shows you the various ways in which R disguises ordinary function calls, and how you can use the standard prefix form to better understand what’s going on.
## 7. Environments
In chapter 7 you will learn about environments, the data structure that powers scoping. This chapter dives deep into environments, describing their structure in depth, and using them to improve your understanding of the four scoping rules described in the “Lexical scoping” section of the functions chapter. Understanding environments is not necessary for day-to-day use of R. But they are important to understand because they power many important R features like lexical scoping, namespaces, and R6 classes, and interact with evaluation to give you powerful tools for making domain specific languages, like dplyr and ggplot2.
"Environment basics" introduces you to the basic properties of an environment and shows you how to create your own. "Recursing over environments" provides a function template for computing with environments, illustrating the idea with a useful function. "Special environments" describes environments used for special purposes: for packages, within functions, for namespaces, and for function execution. "Call stacks" explains the last important environment: the caller environment. This requires you to learn about the call stack, that describes how a function was called. You’ll have seen the call stack if you’ve ever called traceback() to aid debugging. “As data structures" briefly discusses three places where environments are useful data structures for solving other problems.
## 8. Conditions
In chapter 8 we will discuss R’s condition system, which provides a paired set of tools that allow the author of a function to indicate that something unusual is happening, and the user of that function to deal with it. The function author signals conditions with functions like stop(), warning(), and message(), then the function user can handle them with functions like tryCatch() and withCallingHandlers(). Understanding the condition system is important because you’ll often need to play both roles: signalling conditions from the functions you create, and handle conditions signalled by the functions you call.
R’s condition system is based on ideas from Common Lisp. Like R’s approach to object-oriented programming, it is rather different to currently popular programming languages so it is easy to misunderstand, and there has been relatively little written about how to use it effectively. Historically, this has meant that few people have taken full advantage of its power. The goal of this chapter is to remedy that situation. Here you will learn about the big ideas of R’s condition system, as well as learning a bunch of practical tools that will make your code stronger.
“Signalling conditions” introduces the basic tools for signalling conditions, and discusses when it is appropriate to use each type. “Handling conditions” introduces the condition object, and the two fundamental tools of condition handling: tryCatch() for error conditions, and withCallingHandlers() for everything else. “Custom conditions” shows you how to extend the built-in condition objects to store useful data that condition handlers can use to make more informed decisions. “Applications” closes out the chapter with a grab bag of practical applications based on the low-level tools found in earlier sections.
## 9. Functionals
In chapter 9 we'll learn about functionals. A functional is a function that takes a function as an input and returns a vector as output. A common use of functionals is as an alternative to for loops. As Loops do not convey what should be done with the results, it’s better to use a functional. Each functional is tailored for a specific task, so when you recognise the functional you immediately know why it’s being used.
"My first functional: map()" introduces your first functional: purrr::map(). "Map variants" teaches you about 18 (!!) important variants of purrr::map(). Fortunately, their orthogonal design makes them easy to learn, remember, and master. "Predicate functionals" teaches you about predicates: functions that return a single TRUE or FALSE, and the family of functionals that use them to solve common problems. "Base functionals" reviews some functionals in base R that are not members of the map, reduce, or predicate families.
## 10. Function factories
In chapter 10 we’ll discuss function factories. A function factory is a function that makes functions.
Of the three main functional programming tools (functionals, function factories, and function operators), function factories are the least used. Generally, they don’t tend to reduce overall code complexity but instead partition complexity into more easily digested chunks. Function factories are also an important building block for the very useful function operators, which you’ll learn about in Chapter 11 (“Function operators”).
"Factory fundamentals" begins the chapter with an explanation of how function factories work, pulling together ideas from scoping and environments. You’ll also see how function factories can be used to implement a memory for functions, allowing data to persist across calls. "Graphical factories" illustrates the use of function factories with examples from ggplot2. You’ll see two examples of how ggplot2 works with user supplied function factories, and one example of where ggplot2 uses a function factory internally. "Statistical factories" uses function factories to tackle three challenges from statistics: understanding the Box-Cox transform, solving maximum likelihood problems, and drawing bootstrap resamples. "Function factories + functionals" shows how you can combine function factories and functionals to rapidly generate a family of functions from data.
## 11. Function operators
In chapter 11, you’ll learn about function operators. A function operator is a function that takes one (or more) functions as input and returns a function as output.
Function operators are closely related to function factories; indeed they’re just a function factory that takes a function as input. Like factories, there’s nothing you can’t do without them, but they often allow you to factor out complexity in order to make your code more readable and reusable. Function operators are typically paired with functionals. If you’re using a for-loop, there’s rarely a reason to use a function operator, as it will make your code more complex for little gain. If you’re familiar with Python, decorators is just another name for function operators.
"Existing function operators" introduces you to two extremely useful existing function operators, and shows you how to use them to solve real problems. "Case study: Creating your own function operators" works through a problem amenable to solution with function operators: downloading many web pages.
## 13. S3
In chapter 13 we’ll learn about R’s first and simplest OO system: S3. S3 is informal and ad hoc, but there is a certain elegance in its minimalism: you can’t take away any part of it and still have a useful OO system. For these reasons, you should use it, unless you have a compelling reason to do otherwise. S3 is the only OO system used in the base and stats packages, and it’s the most commonly used system in CRAN packages.
"Basics" gives a rapid overview of all the main components of S3: classes, generics, and methods. You’ll also learn about sloop::s3_dispatch(), which we’ll use throughout the chapter to explore how S3 works. Then, "Classes" goes into the details of creating a new S3 class, including the three functions that should accompany most classes: a constructor, a helper, and a validator. Further, "Generics and methods" describes how S3 generics and methods work, including the basics of method dispatch. "Object styles" discusses the four main styles of S3 objects: vector, record, data frame, and scalar. "Inheritance" demonstrates how inheritance works in S3, and shows you what you need to make a class “subclassable”. "Dispatch details" concludes the chapter with a discussion of the finer details of method dispatch including base types, internal generics, group generics, and double dispatch.
## 14. R6
Chapter 14 describes the R6 OOP system which is more like OOP in other languages. It uses the encapsulated OOP paradigm, which means that methods belong to objects, not generics, and you call them like object$method(). Also, R6 objects are mutable, which means that they are modified in place, and hence have reference semantics.
If you’ve learned OOP in another programming language, it’s likely that R6 will feel very natural, and you’ll be inclined to prefer it over S3. Resist the temptation to follow the path of least resistance: in most cases R6 will lead you to non-idiomatic R code.
"Classes and methods" introduces R6::R6Class(), the one function that you need to know to create R6 classes. You’ll learn about the constructor method, $new(), which allows you to create R6 objects, as well as other important methods like $initialize() and $print(). "Controlling access" discusses the access mechanisms of R6: private and active fields. Together, these allow you to hide data from the user, or expose private data for reading but not writing. "Reference semantics" explores the consequences of R6’s reference semantics. You’ll learn about the use of finalizers to automatically clean up any operations performed in the initializer, and a common gotcha if you use an R6 object as a field in another R6 object. "Why R6?" describes why we cover R6, rather than the base RC system.
## 15. S4
In chapter 15 we’ll learn about the S4 OO system, which provides a formal approach to functional OOP. The underlying ideas are similar to S3, but the implementation is much stricter and makes use of specialised functions for creating classes (setClass()), generics (setGeneric()), and methods (setMethod()). Additionally, S4 provides both multiple inheritance (i.e. a class can have multiple parents) and multiple dispatch (i.e. method dispatch can use the class of multiple arguments).
An important new component of S4 is the slot, a named component of the object that is accessed using the specialised subsetting operator @ (pronounced at). The set of slots, and their classes, forms an important part of the definition of an S4 class.
“Basics” gives a quick overview of the main components of S4: classes, generics, and methods. “Classes” dives into the details of S4 classes, including prototypes, constructors, helpers, and validators. “Generics and methods” shows you how to create new S4 generics, and how to supply those generics with methods. You’ll also learn about accessor functions which are designed to allow users to safely inspect and modify object slots. “Method dispatch” dives into the full details of method dispatch in S4. The basic idea is simple, then it rapidly gets more complex once multiple inheritance and multiple dispatch are combined. “S4 and S3” discusses the interaction between S4 and S3, showing you how to use them together.
## 18. Expressions
The focus of chapter 18 chapter are the data structures that underlie expressions. Mastering this knowledge will allow you to inspect and modify captured code, and to generate code with code. We’ll come back to expr() in Chapter 19 (“Quasiquotation”), and to eval() in Chapter 20 (“Evaluation”).
“Abstract syntax trees” introduces the idea of the abstract syntax tree (AST), and reveals the tree like structure that underlies all R code. “Expressions” dives into the details of the data structures that underpin the AST: constants, symbols, and calls, which are collectively known as expressions. “Parsing and grammar” covers parsing, the act of converting the linear sequence of character in code into the AST, and uses that idea to explore some details of R’s grammar. “Walking AST with recursive functions” shows you how you can use recursive functions to compute on the language, writing functions that compute with expressions. “Specialised data structures” circles back to three more specialised data structures: pairlists, missing arguments, and expression vectors.
## 19. Quasiquotation
In chapter 19 you will learn about quasiquotation. Quasiquotation is one of the three pillars of tidy evaluation. You’ll learn about the other two (quosures and the data mask) in Chapter 20 (“Evaluation”). When used alone, quasiquotation is most useful for programming, particularly for generating code. But when it’s combined with the other techniques, tidy evaluation becomes a powerful tool for data analysis.
“Motivation” motivates the development of quasiquotation with a function, cement(), that works like paste() but automatically quotes its arguments so that you don’t have to. “Quoting” gives you the tools to quote expressions, whether they come from you or the user, or whether you use rlang or base R tools. “Unquoting” introduces the biggest difference between rlang quoting functions and base quoting function: unquoting with !! and !!!. “... (dot-dot-dot)” explores another place that you can use !!!, functions that take .... It also introduces the special := operator, which allows you to dynamically change argument names. “Case studies” shows a few practical uses of quoting to solve problems that naturally require some code generation.
## 20. Evaluation
Chapter 20 discusses the topic evaluation. The user-facing inverse of quotation is unquotation: it gives the user the ability to selectively evaluate parts of an otherwise quoted argument. The developer-facing complement of quotation is evaluation: this gives the developer the ability to evaluate quoted expressions in custom environments to achieve specific goals.
“Evaluation basics” discusses the basics of evaluation using eval(), and shows how you can use it to implement key functions like local() and source(). “Quosures” introduces a new data structure, the quosure, which combines an expression with an environment. You’ll learn how to capture quosures from promises, and evaluate them using rlang::eval_tidy(). “Data masks” extends evaluation with the data mask, which makes it trivial to intermingle symbols bound in an environment with variables found in a data frame. “Using tidy evaluation” shows how to use tidy evaluation in practice, focussing on the common pattern of quoting and unquoting, and how to handle ambiguity with pronouns. “Base evaluation” circles back to evaluation in base R, discusses some of the downsides, and shows how to use quasiquotation and evaluation to wrap functions that use NSE.
## 21. Translating R code
Chapter 20 discusses the creation of domain specific languages. The combination of first-class environments, lexical scoping, and metaprogramming gives us a powerful toolkit for translating R code into other languages. One fully-fledged example of this idea is dbplyr, which powers the database backends for dplyr, allowing you to express data manipulation in R and automatically translate it into SQL. You can see the key idea in translate_sql() which takes R code and returns the equivalent SQL.
Translating R to SQL is complex because of the many idiosyncrasies of SQL dialects, so here I’ll develop two simple, but useful, domain specific languages (DSL): one to generate HTML, and the other to generate mathematical equations in LaTeX.
In “HTML” we create a DSL for generating HTML, using quasiquotation and purrr to generate a function for each HTML tag, then tidy evaluation to easily access them. In “LaTeX” we transform mathematically R code into its LaTeX equivalent using a combination of tidy evaluation and expression walking.
## 23. Measuring performance
In chapter 23 we learn about measuring performance.
Before you can make your code faster, you first need to figure out what’s making it slow. This sounds easy, but it’s not. Even experienced programmers have a hard time identifying bottlenecks in their code. So instead of relying on your intuition, you should profile your code: measure the run-time of each line of code using realistic inputs.
Once you’ve identified bottlenecks you’ll need to carefully experiment with alternatives to find faster code that is still equivalent. In Chapter 24 (“Improving performance”) you’ll learn a bunch of ways to speed up code, but first you need to learn how to microbenchmark so that you can precisely measure the difference in performance.
“Profiling” shows you how to use profiling tools to dig into exactly what is making code slow. Section “Microbenchmarking” shows how to use microbenchmarking to explore alternative implementations and figure out exactly which one is fastest.
## 24. Improving performance
Chapter 24 teaches how to improve performance. Once you’ve used profiling to identify a bottleneck, you need to make it faster. It’s difficult to provide general advice on improving performance, but in this chapter we’ll see four techniques that can be applied in many situations and also a general strategy for performance optimisation that helps ensure that your faster code is still correct. Also it is easy to get caught up in trying to remove all bottlenecks. Don’t! Be pragmatic: don’t spend hours of your time to save seconds of computer time.
“Checking for existing solutions” reminds you to look for existing solutions. “Doing as little as possible” emphasises the importance of being lazy: often the easiest way to make a function faster is to let it to do less work. “Vectorise” concisely defines vectorisation, and shows you how to make the most of built-in functions. “Avoiding copies” discusses the performance perils of copying data. “Case study: t-test” pulls all the pieces together into a case study showing how to speed up repeated t-tests by about a thousand times. “Other techniques” finishes the chapter with pointers to more resources that will help you write fast code.
## 25. Rewriting R code in C++
Chapter 25 shows how to improve the performance by rewriting R code in C++ via the Rcpp package.
Rcpp makes it very simple to connect C++ to R. While it is possible to write C or Fortran code for use in R, it will be painful by comparison. Rcpp provides a clean, approachable API that lets you write high-performance code, insulated from R’s complex C API.
Typical bottlenecks that C++ can address include:
* Loops that can’t be easily vectorised because subsequent iterations depend on previous ones.
* Recursive functions, or problems which involve calling functions millions of times. The overhead of calling a function in C++ is much lower than in R.
* Problems that require advanced data structures and algorithms that R doesn’t provide. Through the standard template library (STL), C++ has efficient implementations of many important data structures, from ordered maps to double-ended queues.
“Getting started with C++” teaches you how to write C++ by converting simple R functions to their C++ equivalents. You’ll learn how C++ differs from R, and what the key scalar, vector, and matrix classes are called. “Missing values” teaches you how to work with R’s missing values in C++. “Standard Template Library” shows you how to use some of the most important data structures and algorithms from the standard template library, or STL, built-in to C++. “Case studies” shows two real case studies where Rcpp was used to get considerable performance improvements. “Using Rcpp in a package” teaches you how to add C++ code to a package. “Learning more” concludes the chapter with pointers to more resources to help you learn Rcpp and C++.