forked from UBC-STAT/stat-545-guidebook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm014.Rmd
215 lines (153 loc) · 8.66 KB
/
cm014.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# The Model-Fitting Paradigm in R
## Today
1. Recap from last class on principles of effective visualizations
1. Introduction and motivation for model-fitting in R.
1. Instructor and TA evaluations
1. Worksheet where we do a full model-fitting analysis together
1. Deep thoughts on data analytic work (go through material at home)
Worksheet: the model-fitting paradigm in R.
- [Rmd](https://github.com/STAT545-UBC/Classroom/blob/master/tutorials/cm014-exercise.Rmd),
- [from html](http://stat545.com/Classroom/notes/cm014-exercise.nb.html)
<!-- Note: not able to add parsnip or rsample this year, did briefly mention it though
Possible addition for 2020: `parsnip` package. Maybe `rsample` if it didn't make it into cm011.
-->
## Recap: principles of effective visualizations
Here are some general principles on effective figures
1. Apply [Principle of proportional ink](https://callingbullshit.org/tools/tools_proportional_ink.html)
- Definition: "The amount of ink used to indicate a value should be proportional to the value itself."
- Example: Truncating the y-axis on a bar chart to exaggerate the difference between bars violates the principle of proportional ink
1. Maintain a high data-to-ink ratio: [less is more](https://speakerdeck.com/cherdarchuk/remove-to-improve-the-data-ink-ratio)
- Definition: remove distracting visual elements to focus attention on the data
- Examples: Lighten line weights, remove backgrounds, never use 3D or special effects, remove unnecessary/redundant labels, etc...
1. Always update axes labels and titles on your plots
- In STAT545/547 we take principles of effective visualizations very seriously and you will lose marks if this isn't followed
1. Choose your scale-type carefully
- Whether you choose a linear, logarithm, sqrt scale depends on your data, context, and purpose
1. Choose your graph-type carefully
- Examples: [here](https://serialmentor.com/dataviz/directory-of-visualizations.html) is a great directory of plots
1. Choose colours with accessibility and readability in mind
- Examples: [here](http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette) is a great set of colour schemes that are colour-blind friendly and perceptually uniform
## Resources
- [Tutorial](https://cfss.uchicago.edu/notes/linear-models/) on model fitting in R.
- [`broom` vignette](https://cran.r-project.org/web/packages/broom/vignettes/broom.html)
If you're interested in learning more about the actual statistical / machine learning methods for fitting models, I highly recommend the book ["An Introduction to Statistical Learning"](https://www-bcf.usc.edu/~gareth/ISL/) (freely available online). It uses R, too.
## Introduction and motivation for model-fitting in R.
We will go through this [tutorial](https://cfss.uchicago.edu/notes/linear-models/) from Dr. Benjamin Soltoff.
## Instructor and TA evaluations
Since I was only your instructor for the last two weeks, I probably will not have a course evaluation on Canvas, but I'd still like your feedback. Please give it to me (anonymously) using [this brief survey](https://firasmoosvi.typeform.com/to/ihIYQe).
To fill out an evaluation for *Dr. Coia** (cm001 to cm010), go to the STAT545 Canvas course, and in the left sidebar, click "Course Evaluation".
TA evaluations are still done on paper so I will hand out the TA evaluation forms and a pencil if you need it. Your TAs for this course were: Alejandra, Victor, Hossam, and Yulia.
Let's take some time to fill these out, feedback is very important to all of us.
## Break
## Worksheet
## Deep Thoughts about Data Analytic Work
- [Jenny's Deep Thoughts](https://www.slideshare.net/jenniferbryan5811/cm002-deep-thoughts)
("Deep Thoughts" refers to Jack Handley's skit on Saturday Night Live)
### Reproducibility
This practice is so important, I'm giving it it's own section.
- Jenny Bryan: "If the thought of re-running your analysis makes you ill... you're not doing it right"
- Progress is not about stacking bricks. Don't ever be afraid to re-visiting "earlier" points in the analysis.
The concept goes beyond programming, and many argue as to its meaning, but we're taking it to mean two things as relevant to data analytic work:
__1\. How easily someone can reproduce output.__
Example: generating a plot; or a report.
- Worst case scenario: no source code is available; regenerate from scratch.
- Best case scenario: output is regenerated with the click of a button.
Concepts to live by:
- Source is real. Output is transient.
- Source >> Output.
- Working interactively? Frequently nuke your session and run from the top. It should still work.
- Don't save workspace upon closing.
Who is this "someone" I speak of?
- Future you
- Collaborators
- Critics
- Successors (your capstone partners)
__2\. How easily someone can reproduce _your_ frame of mind / conceptual framework at some point in time.__
Example: All the code files relating to your thesis.
- Worst case scenario: No explanations.
- Best case scneario: README's concisely summarize what's present; comments or markdown in reproducible reports outline big-picture logic.
Broader examples:
- Conclusions drawn from an analysis.
- Understanding of a concept in class.
- Even taking photos to capture moments in time.
Concepts to live by:
- Avoid technical debt, and document the big-picture structure both _within_ files (comments) and _between_ files (README's).
- An unsuspecting data scientist stumbles upon your work. Can they figure out what's going on?
### Coding Practices: Naming
- some people use camelCase, snake_case, or.periods.
- Functions should be verbs (based on the one thing that it achieves). Objects should be nouns. Functions that return functions should be adverbs.
- Don't over-create. The more objects in your global environment there are, the more confusing it will be -- especially if you don't have a consistent rule.
- Don't under-create: avoid magic numbers!
- Bad:
```
x <- rnorm(100)
y <- x + rnorm(100)
```
- Good:
```
n <- 100
x <- rnorm(n)
y <- x + rnorm(n)
```
- Disambiguate from left-to-right, not right-to-left. Think tab completion.
- Bad: `canada_gdp` and `china_gdp`.
- Good: `gdp_canada` and `gdp_china`.
### Coding Practices: Documenting code
Self-documenting code = code that speaks for itself.
- Write self-documenting code with meaningful variable names, and ideally using the tidyverse. Metaprogramming shines here, too.
- Base R:
```
mtcars[mtcars$cyl < 8, c("cyl", "mpg")]
```
- tidyverse:
```
mtcars %>%
filter(cyl < 8) %>%
select(cyl, mpg)
```
Think before you comment your code! It's easy for comments and code to misalign at original writing OR as things shift around.
- Use comments to describe high-level decisions on the analysis.
- Bad:
```
# Lag the river discharge twice.
```
- Good:
```
# create predictors.
```
- Don't use comments to describe what your code is doing on a low-level.
- Bad:
```
# make data frame of cylinders less than 8, with variables 'cyl' and 'mpg'
```
- Good: No comment at all.
- Better: Use the tidyverse.
### Coding Practices: The DRY principle
DRY = don't repeat yourself. Instead,
- Write functions
- Write functions that do a very specific thing
Example of function that does too much: normalize a vector AND possibly compute the location and scale.
Some other best practices with functions:
- Functions should be self-contained, not drawing on global
- For readability
- For function (can't easily export that to it's own package or other analyses, for example)
- Write unit tests for the functions (look at the `testthat` package)
### Coding Practices: Style Guides
When writing code, at the very least, it's important to be consistent with your style. We highly recommend:
- The [pep8 style guide](https://www.python.org/dev/peps/pep-0008/?) for Python.
- The [tidyverse style guide](http://style.tidyverse.org/) for R.
- Sections 4+ are not so relevant yet.
<!--
__Activity__: In five minutes:
1. On your own, read over some style rules: a few small ones, or one bigger one.
2. Form groups of 2-4.
3. With your group members, share the following:
- what you read.
- your opinion about what you read.
- if applicable, style that you especially like to follow.
4. Address some in class.
-->
<!--
Scenario: how could the following practice be made better?
> You're writing code directly to the console, and end up with a plot from your machine learning model, so you click save and use the plot in a report.
-->