-
Notifications
You must be signed in to change notification settings - Fork 0
/
aesthetics.qmd
260 lines (220 loc) · 11 KB
/
aesthetics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
# Why Aesthetic Choices are Important {#sec-aesthetics}
Take a moment to appreciate this comic from [xkcd.com](https://xkcd.com/2864/):
```{r setup}
#| echo: false
#| include: false
source("_common.R")
```
[![](https://imgs.xkcd.com/comics/compact_graphs.png){fig-alt="Comic from xkcd.com"}](https://xkcd.com/2864/)
For most people, aesthetics is the art of making things pleasant to look at. To the data visualizer though, "aesthetics" means something much more precise: Aesthetics are the visual representation of variables.
Just as journalists need to decide which words to use to express their ideas, data visualizers need to decide which aesthetics to use to express their variables.
There are many options. Below are six different ways to represent the numbers "1", "2", and "3", each mapping them to a different aesthetic.
```{r}
#| echo: false
# Some simple data
tibble(num = c(1, 2, 3)) |>
ggplot() +
# position
geom_point(aes(x = factor(num), y = "position"),
size = 4) +
geom_hline(yintercept = "position", linewidth = .2) +
geom_linerange(aes(x = num, ymin = 2.8), ymax = 3.2, linewidth = .2) +
# shape
geom_point(aes(x = num, y = "shape", shape = factor(num)),
size = 4) +
# color
geom_point(aes(x = num, y = "color", color = num),
size = 4) +
# size
geom_point(aes(x = num, y = "size", size = num)) +
# label
geom_label(aes(x = num, y = "label", label = as.character(num)),
size = 4, label.r = unit(0, "lines")) +
# alpha
geom_point(aes(x = num, y = "transparency (alpha)", alpha = num),
size = 4) +
guides(shape = "none",
color = "none",
size = "none",
alpha = "none") +
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text.y = element_text(size = 15, hjust = 1),
axis.text.x = element_text(size = 15)) +
scale_x_discrete(position = "top")
```
Let's take a concrete example. @eichstaedt_etal_2015 collected Twitters posts from 935 U.S. counties, and counted the number of words related to positive emotions. This emotional measure could then be connected with known demographic measures of each county, such as race and average income.
```{r}
#| include: false
twitter_counties <- read_csv("data/eichstaedt_etal_2015_county.freqs.dictionaries_dense.csv") |>
rename(fips = group_id) |>
left_join(read_csv("data/eichstaedt_etal_2015_countyoutcomes.csv")) |>
mutate(
income = `incomeHC01_VC85ACS3yr$10`,
maj = case_when(
`blackPOP255210D$10` >= 50 ~ "Majority Black",
`hispanicPOP405210D$10` >= 50 ~ "Majority Hispanic",
.default = "Majority Non-Black/Hispanic"
)
) |>
select(county, state, income, posEmotions, maj)
```
```{r}
head(twitter_counties)
```
There are many ways to present this information graphically. Each choice emphasizes a different aspect of the data, and tells a different story.
In the following visualization,
- `income` is mapped to the "x" position aesthetic
- `posEmotions` (positive emotions) is mapped to the "y" position aesthetic
- `maj` (racial majority) is mapped to the color aesthetic
```{r}
#| warning: false
library(ggborderline) # for making the lines pop
twitter_counties |>
ggplot(
aes(income, posEmotions,
# rearrange the categorical variable so that the order
# in the legend matches the order in the plot
color = factor(
maj,
levels = c("Majority Non-Black/Hispanic",
"Majority Hispanic",
"Majority Black")
)
)
) +
# scatterplot
geom_point(alpha = .5,
# draw sample such that "Majority Non-Black/Hispanic"
# points don't overpower the others
data = twitter_counties |>
group_by(maj) |>
slice_sample(n = 100)) +
# loess regression
stat_smooth(
# borders that match the lines, but are slightly darker
aes(bordercolor = after_scale(colorspace::darken(color))),
se = FALSE, geom = "borderline",
linewidth = 1, lineend = "square"
) +
theme_bw() +
# nicer color palette
scale_color_brewer(
palette = "Paired",
direction = -1
) +
# proper formatting for income
scale_x_continuous(labels=scales::dollar_format()) +
labs(
x = "County Average Income",
y = "Positive Emotional Words in Twitter Posts\n(proportion of total words)",
color = ""
)
```
This is the most intuitive way to organize the three variables. By mapping income to the x axis, we lightly suggest that it is the cause of whatever is happening on the y axis---in this case positive emotion. The idea that higher income causes positive emotion is intuitive---any people believe they would be happier with a higher income. People accustomed to languages that are written left-to-right, like English, will tend to think about what happens as they move left to right on the graph. Three LOESS regression lines encourage the viewer to compare the slopes of the three color groups, which they will go through from top to bottom:
1. In counties without a Black or Hispanic majority, greater income means more positive emotion, up to about \$60,000 a year, when the line starts flattening out.
2. In counties with a majority Hispanic population, greater income means dramatically more positive emotion, on average.
3. In counties with a majority Black population, greater income doesn't seem to make much of a difference.
But just because this scheme is the most intuitive does not mean it is the best one.
The next visualization shows the same data but tells a different story. This one also has x, y, and color, but they are mapped to the variables differently:
- `income` is binned and mapped to the "x" position aesthetic
- `maj` (racial majority) is mapped to the "y" position aesthetic
- `posEmotions` (positive emotions) is mapped to the "fill" color aesthetic
```{r}
#| warning: false
twitter_counties |>
# hand-made bins (note the inconsistent bin width to
# give more space to the center of the distribution)
mutate(
income = factor(
case_when(
income > 100000 ~ "$100,000+",
income > 80000 ~ "$80,000-$100,000",
income > 60000 ~ "$60,000-$80,000",
income > 50000 ~ "$50,000-$60,00",
income > 40000 ~ "$40,000-$50,00",
.default = "$20,000-$40,00"),
levels = c("$20,000-$40,00",
"$40,000-$50,00",
"$50,000-$60,00",
"$60,000-$80,000",
"$80,000-$100,000",
"$100,000+")
)
) |>
# aggregate by the new bins
group_by(income, maj) |>
summarise(
posEmotions = mean(posEmotions,
na.rm = TRUE)
) |>
# plot
ggplot(aes(income, maj, fill = posEmotions)) +
# tiles with a little bit of space in between
geom_tile(width = .95, height = .95) +
# minimal theme
theme_minimal() +
# color scale to emphasize differences between extremes
scale_fill_gradient2(low = "blue",
mid = "green",
high = "yellow",
midpoint = .0055) +
labs(
x = "County Average Income",
y = "",
fill = "Positive Emotional Words\nin Twitter Posts\n(proportion of total words)"
) +
theme(
# angles x axis text to fit it all in
axis.text.x = element_text(angle = 30, hjust = 1),
axis.title = element_text(size = 8),
legend.title = element_text(size = 8)
) +
# constrain the tiles to be perfectly square
coord_equal()
```
Using fill makes it harder to see the slope within each racial group, and easier to see the differences between them. The vertical ordering from most positive emotion (in majority non-Black/Hispanic counties) to least positive emotion (majority Black counties) emphasizes this even more. The blank squares on the grid also make the point that there are no majority Black or Hispanic counties with average incomes above \$80,000.
Let's try one more way to present these data. In this visualization,
- `income` is mapped to the "y" position aesthetic
- `posEmotions` (positive emotions) is mapped to the "color" aesthetic
- `maj` (racial majority) is mapped to the "x" position aesthetic
```{r}
#| warning: false
set.seed(2023)
twitter_counties |>
ggplot(aes(maj, income, fill = posEmotions)) +
# sina plot
ggbeeswarm::geom_quasirandom(
aes(color = after_scale(colorspace::darken(fill, .3))),
alpha = .5, method = "pseudorandom",
shape = 21, varwidth = TRUE
) +
# color scheme which maximizes the visibility of different
# values among the crowd (this is a losing battle)
scale_fill_gradient2(low = "red",
mid = "white",
high = "blue",
midpoint = .005) +
# unintrusive theme
theme_bw() +
# log scale, and proper formatting for income
scale_y_continuous(
labels=scales::dollar_format(),
trans = "log10"
) +
labs(
x = " ",
y = "County Average Income",
fill = "Positive Emotional Words\nin Twitter Posts\n(proportion of total words)"
) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
```
This is a *sina plot* (also known as a *beeswarm plot*), in which point clouds are arranged by a continuous variable on one axis, a categorical variable on the other axis, and spread out in proportion to their density along the spaces in between the categories.
Sina plots are a good way to compare distributions of different groups (they are almost always more informative than box plots or violin plots); this plot emphasizes that counties with a majority Black population tend to have relatively low average incomes. The other story that this plot tells is the uneven sizes of the three groups---by separating out the points in each group, this visualization emphasizes the fact that there are very few US counties with majority Black or Hispanic population.
This plot makes it very difficult to learn anything about positive emotion in Twitter posts. The colors themselves give some guidance: the white in the middle of the scale suggests that it represents some sort of zero point---a "normal" amount of positive emotion. Nevertheless, the viewer will have to squint in order to notice the trend for higher income counties to be happier. The emotional variable is not adding much to this visualization.
In this chapter we have seen how choices about which aesthetics to map to variables make a big difference in the way a data visualization is interpreted. Three visualization of the same data can emphasize tell very different stories.
------------------------------------------------------------------------
Press the "View Source" button below to see the hidden code blocks in this chapter.
------------------------------------------------------------------------