forked from UBC-DSCI/introduction-to-datascience
-
Notifications
You must be signed in to change notification settings - Fork 0
/
inference.Rmd
1200 lines (1013 loc) · 55.6 KB
/
inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Statistical inference {#inference}
```{r inference-setup, include = FALSE}
knitr::opts_chunk$set(warning = FALSE, fig.align = "center")
library(gridExtra)
library(tidyverse)
library(grid) # grid.bezier()
options(digits = 3) # set number of digits to display in output to match writing
max_count <- function(dist) {
max(ggplot_build(dist)$data[[1]]$count)
}
max_x <- function(dist) {
max(ggplot_build(dist)$data[[1]]$xmax)
}
min_x <- function(dist) {
ggp_data <- ggplot_build(dist)
min(ggp_data$data[[1]]$xmin)
}
theme_update(axis.title = element_text(size = 12)) # modify axis label size in plots
```
## Overview
A typical data analysis task in practice is to draw conclusions about some
unknown aspect of a population of interest based on observed data sampled from
that population; we typically do not get data on the *entire* population. Data
analysis questions regarding how summaries, patterns, trends, or relationships
in a data set extend to the wider population are called *inferential
questions*. This chapter will start with the fundamental ideas of sampling from
populations and then introduce two common techniques in statistical inference:
*point estimation* and *interval estimation*.
## Chapter learning objectives
By the end of the chapter, readers will be able to do the following:
* Describe real-world examples of questions that can be answered with statistical inference.
* Define common population parameters (e.g., mean, proportion, standard deviation) that are often estimated using sampled data, and estimate these from a sample.
* Define the following statistical sampling terms: population, sample, population parameter, point estimate, and sampling distribution.
* Explain the difference between a population parameter and a sample point estimate.
* Use R to draw random samples from a finite population.
* Use R to create a sampling distribution from a finite population.
* Describe how sample size influences the sampling distribution.
* Define bootstrapping.
* Use R to create a bootstrap distribution to approximate a sampling distribution.
* Contrast the bootstrap and sampling distributions.
## Why do we need sampling?
We often need to understand how quantities we observe in a subset
of data relate to the same quantities in the broader population. For example, suppose a
retailer is considering selling iPhone accessories, and they want to estimate
how big the market might be. Additionally, they want to strategize how they can
market their products on North American college and university campuses. This
retailer might formulate the following question:
*What proportion of all undergraduate students in North America own an iPhone?*
In the above question, we are interested in making a conclusion about *all*
undergraduate students in North America; this is referred to as the **population**. \index{population} In
general, the population is the complete collection of individuals or cases we
are interested in studying. Further, in the above question, we are interested
in computing a quantity—the proportion of iPhone owners—based on
the entire population. This proportion is referred to as a **population parameter**. \index{population!parameter} In
general, a population parameter is a numerical characteristic of the entire
population. To compute this number in the example above, we would need to ask
every single undergraduate in North America whether they own an iPhone. In
practice, directly computing population parameters is often time-consuming and
costly, and sometimes impossible.
A more practical approach would be to make measurements for a **sample**, i.e., a \index{sample}
subset of individuals collected from the population. We can then compute a
**sample estimate**—a numerical characteristic of the sample—that \index{sample!estimate}
estimates the population parameter. For example, suppose we randomly selected
ten undergraduate students across North America (the sample) and computed the
proportion of those students who own an iPhone (the sample estimate). In that
case, we might suspect that proportion is a reasonable estimate of the
proportion of students who own an iPhone in the entire population. Figure
\@ref(fig:11-population-vs-sample) illustrates this process.
In general, the process of using a sample to make a conclusion about the
broader population from which it is taken is referred to as **statistical inference**.
\index{inference}\index{statistical inference|see{inference}}
```{r 11-population-vs-sample, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Population versus sample.", out.width="100%"}
knitr::include_graphics("img/population_vs_sample.png")
```
Note that proportions are not the *only* kind of population parameter we might
be interested in. For example, suppose an undergraduate student studying at the University
of British Columbia in Canada is looking for an apartment
to rent. They need to create a budget, so they want to know about
studio apartment rental prices in Vancouver. This student might
formulate the question:
*What is the average price per month of studio apartment rentals in Vancouver?*
In this case, the population consists of all studio apartment rentals in Vancouver, and the
population parameter is the *average price per month*. Here we used the average
as a measure of the center to describe the "typical value" of studio apartment
rental prices. But even within this one example, we could also be interested in
many other population parameters. For instance, we know that not every studio
apartment rental in Vancouver will have the same price per month. The student
might be interested in how much monthly prices vary and want to find a measure
of the rentals' spread (or variability), such as the standard deviation. Or perhaps the
student might be interested in the fraction of studio apartment rentals that
cost more than \$1000 per month. The question we want to answer will help us
determine the parameter we want to estimate. If we were somehow able to observe
the whole population of studio apartment rental offerings in Vancouver, we
could compute each of these numbers exactly; therefore, these are all
population parameters. There are many kinds of observations and population
parameters that you will run into in practice, but in this chapter, we will
focus on two settings:
1. Using categorical observations to estimate the proportion of a category
2. Using quantitative observations to estimate the average (or mean)
## Sampling distributions
### Sampling distributions for proportions
We will look at an example using data from
[Inside Airbnb](http://insideairbnb.com/) [@insideairbnb]. Airbnb \index{Airbnb} is an online
marketplace for arranging vacation rentals and places to stay. The data set
contains listings for Vancouver, Canada, in September 2020. Our data
includes an ID number, neighborhood, type of room, the number of people the
rental accommodates, number of bathrooms, bedrooms, beds, and the price per
night.
<!--
airbnb <- read_csv("data/listings.csv") |>
select(id, neighborhood = neighborhood_cleansed, room_type, accommodates, bathrooms = bathrooms_text, bedrooms, beds, price) |>
mutate(price = as.numeric(str_remove(price, "[$]"))) |>
na.omit()
airbnb <- airbnb |>
mutate(id = 1:nrow(airbnb))
-->
\index{seed!set.seed}
```{r load-airbnb, message = FALSE, warning = FALSE}
library(tidyverse)
set.seed(123)
airbnb <- read_csv("data/listings.csv")
airbnb
```
Suppose the city of Vancouver wants information about Airbnb rentals to help
plan city bylaws, and they want to know how many Airbnb places are listed as
entire homes and apartments (rather than as private or shared rooms). Therefore
they may want to estimate the true proportion of all Airbnb listings where the
"type of place" is listed as "entire home or apartment." Of course, we usually
do not have access to the true population, but here let's imagine (for learning
purposes) that our data set represents the population of all Airbnb rental
listings in Vancouver, Canada. We can find the proportion of listings where
`room_type == "Entire home/apt"`.
\index{pull}\index{sum}\index{nrow}
```{r 11-example-proportions2, echo = TRUE, message = FALSE, warning = FALSE }
airbnb |>
summarize(
n = sum(room_type == "Entire home/apt"),
proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
)
```
```{r 11-population-parameter, echo = F}
population_proportion <- airbnb |>
summarize(
n = sum(room_type == "Entire home/apt"),
proportion = sum(room_type == "Entire home/apt") / nrow(airbnb)
) |>
pull()
```
We can see that the proportion of `Entire home/apt` listings in
the data set is `r round(population_proportion,3)`. This
value, `r round(population_proportion,3)`, is the population parameter. Remember, this
parameter value is usually unknown in real data analysis problems, as it is
typically not possible to make measurements for an entire population.
Instead, perhaps we can approximate it with a small subset of data!
To investigate this idea, let's try randomly selecting 40 listings (*i.e.,* taking a random sample of
size 40 from our population), and computing the proportion for that sample.
We will use the `rep_sample_n` function \index{rep\_sample\_n} from the `infer`
package \index{infer} to take the sample. The arguments of `rep_sample_n` are (1) the data frame to
sample from, and (2) the size of the sample to take.
```{r 11-example-proportions3-seed, echo = FALSE, message = FALSE, warning = FALSE}
# hidden seed
set.seed(123)
```
```{r 11-example-proportions3, echo = TRUE, message = FALSE, warning = FALSE}
library(infer)
sample_1 <- rep_sample_n(tbl = airbnb, size = 40)
airbnb_sample_1 <- summarize(sample_1,
n = sum(room_type == "Entire home/apt"),
prop = sum(room_type == "Entire home/apt") / 40
)
airbnb_sample_1
```
Here we see that the proportion of entire home/apartment listings in this
random sample is `r round(airbnb_sample_1$prop,2)`. Wow—that's close to our
true population value! But remember, we computed the proportion using a random sample of size 40.
This has two consequences. First, this value is only an *estimate*, i.e., our best guess
of our population parameter using this sample.
Given that we are estimating a single value here, we often
refer to it as a **point estimate**. Second, since the sample was random,
if we were to take *another* random sample of size 40 and compute the proportion for that sample,
we would not get the same answer:
```{r 11-example-proportions4-seed, echo = FALSE, message = FALSE, warning = FALSE}
# hidden seed
set.seed(1234)
```
```{r 11-example-proportions4, echo = TRUE, message = FALSE, warning = FALSE}
sample_2 <- rep_sample_n(airbnb, size = 40)
airbnb_sample_2 <- summarize(sample_2,
n = sum(room_type == "Entire home/apt"),
prop = sum(room_type == "Entire home/apt") / 40
)
airbnb_sample_2
```
Confirmed! We get a different value for our estimate this time.
That means that our point estimate might be unreliable. Indeed, estimates vary from sample to
sample due to **sampling variability**. But just how much
should we expect the estimates of our random samples to vary?
Or in other words, how much can we really trust our point estimate based on a single sample?
To understand this, we will simulate many samples (much more than just two)
of size 40 from our population of listings and calculate the proportion of
entire home/apartment listings in each sample. This simulation will create
many sample proportions, which we can visualize using a histogram. The
distribution of the estimate for all possible samples of a given size (which we
commonly refer to as $n$) from a population is called
a **sampling distribution**. \index{sampling distribution} The sampling distribution will help us see how much we would
expect our sample proportions from this population to vary for samples of size 40.
We again use the `rep_sample_n` to take samples of size 40 from our
population of Airbnb listings. But this time we set the `reps` argument to 20,000 to specify
that we want to take 20,000 samples of size 40. \index{rep\_sample\_n!reps argument}
\index{rep\_sample\_n!size argument}
```{r 11-example-proportions5, echo = TRUE, message = FALSE, warning = FALSE}
samples <- rep_sample_n(airbnb, size = 40, reps = 20000)
samples
```
Notice that the column `replicate` indicates the replicate, or sample, to which
each listing belongs. Above, since by default R only prints the first few rows,
it looks like all of the listings have `replicate` set to 1. But you can
check the last few entries using the `tail()` function to verify that
we indeed created 20,000 samples (or replicates).
```{r 11-example-proportions5b, echo = TRUE, message = FALSE, warning = FALSE}
tail(samples)
```
Now that we have obtained the samples, we need to compute the
proportion of entire home/apartment listings in each sample.
We first group the data by the `replicate` variable—to group the
set of listings in each sample together—and then use `summarize`
to compute the proportion in each sample.
We print both the first and last few entries of the resulting data frame
below to show that we end up with 20,000 point estimates, one for each of the 20,000 samples.
```{r 11-example-proportions6, echo = TRUE, message = FALSE, warning = FALSE}
sample_estimates <- samples |>
group_by(replicate) |>
summarize(sample_proportion = sum(room_type == "Entire home/apt") / 40)
sample_estimates
tail(sample_estimates)
```
We can now visualize the sampling distribution of sample proportions
for samples of size 40 using a histogram in Figure \@ref(fig:11-example-proportions7). Keep in mind: in the real world,
we don't have access to the full population. So we
can't take many samples and can't actually construct or visualize the sampling distribution.
We have created this particular example
such that we *do* have access to the full population, which lets us visualize the
sampling distribution directly for learning purposes.
```{r 11-example-proportions7, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Sampling distribution of the sample proportion for sample size 40.", fig.height = 3.3, fig.width = 4.2}
sampling_distribution <- ggplot(sample_estimates, aes(x = sample_proportion)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey", bins = 12) +
labs(x = "Sample proportions", y = "Count") +
theme(text = element_text(size = 12))
sampling_distribution
```
The sampling distribution in Figure \@ref(fig:11-example-proportions7) appears
to be bell-shaped, is roughly symmetric, and has one peak. It is centered
around `r round(mean(sample_estimates$sample_proportion),1)` and the sample proportions
range from about `r round(min(sample_estimates$sample_proportion), 1)` to about
`r round(max(sample_estimates$sample_proportion), 1)`. In fact, we can
calculate the mean of the sample proportions. \index{sampling distribution!shape}
```{r 11-example-proportions8, echo = TRUE, message = FALSE, warning = FALSE}
sample_estimates |>
summarize(mean = mean(sample_proportion))
```
We notice that the sample proportions are centered around the population
proportion value, `r round(population_proportion,3)`! In general, the mean of
the sampling distribution should be equal to the population proportion.
This is great news because it means that the sample proportion is neither an overestimate nor an
underestimate of the population proportion.
In other words, if you were to take many samples as we did above, there is no tendency
towards over or underestimating the population proportion.
In a real data analysis setting where you just have access to your single
sample, this implies that you would suspect that your sample point estimate is
roughly equally likely to be above or below the true population proportion.
### Sampling distributions for means
In the previous section, our variable of interest—`room_type`—was
*categorical*, and the population parameter was a proportion. As mentioned in
the chapter introduction, there are many choices of the population parameter
for each type of variable. What if we wanted to infer something about a
population of *quantitative* variables instead? For instance, a traveler
visiting Vancouver, Canada may wish to estimate the
population *mean* (or average) price per night of Airbnb listings. Knowing
the average could help them tell whether a particular listing is overpriced.
We can visualize the population distribution of the price per night with a histogram.
```{r, echo = FALSE}
options(pillar.sigfig = 5)
```
```{r 11-example-means2, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Population distribution of price per night (Canadian dollars) for all Airbnb listings in Vancouver, Canada.", fig.height = 3.5, fig.width = 4.5}
population_distribution <- ggplot(airbnb, aes(x = price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Price per night (Canadian dollars)", y = "Count") +
theme(text = element_text(size = 12))
population_distribution
```
In Figure \@ref(fig:11-example-means2), we see that the population distribution \index{population!distribution}
has one peak. It is also skewed (i.e., is not symmetric): most of the listings are
less than \$250 per night, but a small number of listings cost much more,
creating a long tail on the histogram's right side.
Along with visualizing the population, we can calculate the population mean,
the average price per night for all the Airbnb listings.
```{r 11-example-means-popmean, echo = TRUE, message = FALSE, warning = FALSE}
population_parameters <- airbnb |>
summarize(pop_mean = mean(price))
population_parameters
```
The price per night of all Airbnb rentals in Vancouver, BC
is \$`r round(population_parameters$pop_mean,2)`, on average. This value is our
population parameter since we are calculating it using the population data. \index{population!parameter}
Now suppose we did not have access to the population data (which is usually the
case!), yet we wanted to estimate the mean price per night. We could answer
this question by taking a random sample of as many Airbnb listings as our time
and resources allow. Let's say we could do this for 40 listings. What would
such a sample look like? Let's take advantage of the fact that we do have
access to the population data and simulate taking one random sample of 40
listings in R, again using `rep_sample_n`.
\index{rep\_sample\_n}
```{r 11-example-means3, echo = TRUE, message = FALSE, warning = FALSE}
one_sample <- airbnb |>
rep_sample_n(40)
```
We can create a histogram to visualize the distribution of observations in the
sample (Figure \@ref(fig:11-example-means-sample-hist)), and calculate the mean
of our sample.
```{r 11-example-means-sample-hist, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of price per night (Canadian dollars) for sample of 40 Airbnb listings.", fig.height = 3.5, fig.width = 4.5}
sample_distribution <- ggplot(one_sample, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Price per night (Canadian dollars)", y = "Count") +
theme(text = element_text(size = 12))
sample_distribution
estimates <- one_sample |>
summarize(sample_mean = mean(price))
estimates
```
The average value of the sample of size 40
is \$`r round(estimates$sample_mean, 2)`. This
number is a point estimate for the mean of the full population.
Recall that the population mean was
\$`r round(population_parameters$pop_mean,2)`. So our estimate was fairly close to
the population parameter: the mean was about
`r round(100*abs(estimates$sample_mean - population_parameters$pop_mean)/population_parameters$pop_mean, 1)`%
off. Note that we usually cannot compute the estimate's accuracy in practice
since we do not have access to the population parameter; if we did, we wouldn't
need to estimate it!
Also, recall from the previous section that the point estimate can vary; if we
took another random sample from the population, our estimate's value might
change. So then, did we just get lucky with our point estimate above? How much
does our estimate vary across different samples of size 40 in this example?
Again, since we have access to the population, we can take many samples and
plot the sampling distribution of \index{sampling distribution} sample means for samples of size 40 to
get a sense for this variation. In this case, we'll use 20,000 samples of size
40.
```{r}
samples <- rep_sample_n(airbnb, size = 40, reps = 20000)
samples
```
Now we can calculate the sample mean for each replicate and plot the sampling
distribution of sample means for samples of size 40.
```{r 11-example-means4, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap= "Sampling distribution of the sample means for sample size of 40.", fig.height = 3.5, fig.width = 4.5}
sample_estimates <- samples |>
group_by(replicate) |>
summarize(sample_mean = mean(price))
sample_estimates
sampling_distribution_40 <- ggplot(sample_estimates, aes(x = sample_mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Sample mean price per night (Canadian dollars)", y = "Count") +
theme(text = element_text(size = 12))
sampling_distribution_40
```
In Figure \@ref(fig:11-example-means4), the sampling distribution of the mean
has one peak and is \index{sampling distribution!shape} bell-shaped. Most of the estimates are between
about \$`r round(quantile(sample_estimates$sample_mean)[2], -1)` and
\$`r round(quantile(sample_estimates$sample_mean)[4], -1)`; but there are
a good fraction of cases outside this range (i.e., where the point estimate was
not close to the population parameter). So it does indeed look like we were
quite lucky when we estimated the population mean with only
`r round(100*abs(estimates$sample_mean - population_parameters$pop_mean)/population_parameters$pop_mean, 1)`% error.
Let's visualize the population distribution, distribution of the sample, and
the sampling distribution on one plot to compare them in Figure
\@ref(fig:11-example-means5). Comparing these three distributions, the centers
of the distributions are all around the same price (around \$150). The original
population distribution has a long right tail, and the sample distribution has
a similar shape to that of the population distribution. However, the sampling
distribution is not shaped like the population or sample distribution. Instead,
it has a bell shape, and it has a lower spread than the population or sample
distributions. The sample means vary less than the individual observations
because there will be some high values and some small values in any random
sample, which will keep the average from being too extreme.
\index{sampling distribution!compared to population distribution}
<!---
```{r 11-example-means4.5}
sample_estimates |>
summarize(mean_of_sample_means = mean(sample_mean))
```
Notice that the mean of the sample means is \$`r round(mean(sample_estimates$sample_mean),2)`. Recall that the population mean
was \$`r round(mean(airbnb$price),2)`.
-->
```{r 11-example-means5, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 5.5, fig.width = 4, fig.cap = "Comparison of population distribution, sample distribution, and sampling distribution."}
grid.arrange(population_distribution +
ggtitle("Population") +
xlim(min(airbnb$price), 600),
sample_distribution +
ggtitle("Sample (n = 40)") +
xlim(min(airbnb$price), 600),
sampling_distribution_40 +
ggtitle("Sampling distribution of the mean \n for samples of size 40") +
xlim(min(airbnb$price), 600),
nrow = 3
)
```
Given that there is quite a bit of variation in the sampling distribution of
the sample mean—i.e., the point estimate that we obtain is not very
reliable—is there any way to improve the estimate? One way to improve a
point estimate is to take a *larger* sample. To illustrate what effect this
has, we will take many samples of size 20, 50, 100, and 500, and plot the
sampling distribution of the sample mean. We indicate the mean of the sampling
distribution with a red vertical line.
```{r 11-example-means6, echo = FALSE, message = FALSE, warning = FALSE}
## Sampling n = 20, 50, 100, 500
sample_estimates_20 <- rep_sample_n(airbnb, size = 20, reps = 20000) |>
group_by(replicate) |>
summarize(sample_mean = mean(price))
sample_estimates_50 <- rep_sample_n(airbnb, size = 50, reps = 20000) |>
group_by(replicate) |>
summarize(sample_mean = mean(price))
sample_estimates_100 <- rep_sample_n(airbnb, size = 100, reps = 20000) |>
group_by(replicate) |>
summarize(sample_mean = mean(price))
sample_estimates_500 <- rep_sample_n(airbnb, size = 500, reps = 20000) |>
group_by(replicate) |>
summarize(sample_mean = mean(price))
## Sampling distribution n = 20
sampling_distribution_20 <- ggplot(sample_estimates_20, aes(x = sample_mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Sample mean price per night\n(Canadian dollars)", y = "Count") +
ggtitle("n = 20")
## Sampling distribution n = 50
sampling_distribution_50 <- ggplot(sample_estimates_50, aes(x = sample_mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
ylab("Count") +
xlab("Sample mean price per night\n(Canadian dollars)") +
ggtitle("n = 50") +
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
## Sampling distribution n = 100
sampling_distribution_100 <- ggplot(sample_estimates_100, aes(x = sample_mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
ylab("Count") +
xlab("Sample mean price per night\n(Canadian dollars)") +
ggtitle("n = 100") +
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
## Sampling distribution n = 500
sampling_distribution_500 <- ggplot(sample_estimates_500, aes(x = sample_mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
ylab("Count") +
xlab("Sample mean price per night\n(Canadian dollars)") +
ggtitle("n = 500") +
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20))
```
```{r 11-example-means7, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Comparison of sampling distributions, with mean highlighted as a vertical red line."}
annotated_sampling_dist_20 <- sampling_distribution_20 +
geom_vline(xintercept = mean(sample_estimates$sample_mean), col = "red") +
xlim(min_x(sampling_distribution_20), max_x(sampling_distribution_20)) +
ggtitle("n = 20") +
annotate("text",
x = max_x(sampling_distribution_20),
y = max_count(sampling_distribution_20),
hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates$sample_mean), 1))
)+ theme(text = element_text(size = 12), axis.title=element_text(size=12))
#+
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_20), hjust = 1, vjust = 3,
# label = paste("sd = ", round(sd(sample_estimates$sample_mean), 1)))
annotated_sampling_dist_50 <- sampling_distribution_50 +
geom_vline(xintercept = mean(sample_estimates_50$sample_mean), col = "red") +
## x limits set the same as n = 20 graph, y is this graph
annotate("text",
x = max_x(sampling_distribution_20),
y = max_count(sampling_distribution_50),
hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates_50$sample_mean), 1))
)+ theme(text = element_text(size = 12), axis.title=element_text(size=12)) #+
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_50), hjust = 1, vjust = 3,
# label = paste("sd = ", round(sd(sample_estimates_50$sample_mean), 1)))
annotated_sampling_dist_100 <- sampling_distribution_100 +
geom_vline(xintercept = mean(sample_estimates_100$sample_mean), col = "red") +
annotate("text",
x = max_x(sampling_distribution_20),
y = max_count(sampling_distribution_100),
hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates_100$sample_mean), 1))
) + theme(text = element_text(size = 12), axis.title=element_text(size=12)) #+
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_100), hjust = 1, vjust = 3,
# label = paste("sd = ", round(sd(sample_estimates_100$sample_mean), 1)))
annotated_sampling_dist_500 <- sampling_distribution_500 +
geom_vline(xintercept = mean(sample_estimates_500$sample_mean), col = "red") +
annotate("text",
x = max_x(sampling_distribution_20),
y = max_count(sampling_distribution_500),
hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates_500$sample_mean), 1))
) + theme(text = element_text(size = 12), axis.title=element_text(size=12))
#+
# annotate("text", x = max_x(sampling_distribution_20), y = max_count(sampling_distribution_500), hjust = 1, vjust = 3,
# label = paste("sd = ", round(sd(sample_estimates_500$sample_mean), 1)))
grid.arrange(annotated_sampling_dist_20,
annotated_sampling_dist_50,
annotated_sampling_dist_100,
annotated_sampling_dist_500,
nrow = 2, ncol = 2
)
```
Based on the visualization in Figure \@ref(fig:11-example-means7), three points
about the sample mean become clear. First, the mean of the sample mean (across
samples) is equal to the population mean. In other words, the sampling
distribution is centered at the population mean. Second, increasing the size of
the sample decreases the spread (i.e., the variability) of the sampling
distribution. Therefore, a larger sample size results in a more reliable point
estimate of the population parameter. And third, the distribution of the sample
mean is roughly bell-shaped. \index{sampling distribution!effect of sample size}
> **Note:** You might notice that in the `n = 20` case in Figure \@ref(fig:11-example-means7),
> the distribution is not *quite* bell-shaped. There is a bit of skew towards the right!
> You might also notice that in the `n = 50` case and larger, that skew seems to disappear.
> In general, the sampling distribution—for both means and proportions—only
> becomes bell-shaped *once the sample size is large enough*.
> How large is "large enough?" Unfortunately, it depends entirely on the problem at hand. But
> as a rule of thumb, often a sample size of at least 20 will suffice.
<!--- > **Note:** If random samples of size $n$ are taken from a population, the sample mean $\bar{x}$ will be approximately Normal with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ as long as the sample size $n$ is large enough. $\mu$ is the population mean, $\sigma$ is the population standard deviation, $\bar{x}$ is the sample mean, and $n$ is the sample size.
> If samples are selected from a finite population as we are doing in this chapter, we should apply a finite population correction. We multiply $\frac{\sigma}{\sqrt{n}}$ by $\sqrt{\frac{N - n}{N - 1}}$ where $N$ is the population size and $n$ is the sample size. If our sample size, $n$, is small relative to the population size, this finite correction factor is less important.
--->
### Summary
1. A point estimate is a single value computed using a sample from a population (e.g., a mean or proportion).
2. The sampling distribution of an estimate is the distribution of the estimate for all possible samples of a fixed size from the same population.
3. The shape of the sampling distribution is usually bell-shaped with one peak and centered at the population mean or proportion.
4. The spread of the sampling distribution is related to the sample size. As the sample size increases, the spread of the sampling distribution decreases.
## Bootstrapping
### Overview
*Why all this emphasis on sampling distributions?*
We saw in the previous section that we could compute a **point estimate** of a
population parameter using a sample of observations from the population. And
since we constructed examples where we had access to the population, we could
evaluate how accurate the estimate was, and even get a sense of how much the
estimate would vary for different samples from the population. But in real
data analysis settings, we usually have *just one sample* from our population
and do not have access to the population itself. Therefore we cannot construct
the sampling distribution as we did in the previous section. And as we saw, our
sample estimate's value can vary significantly from the population parameter.
So reporting the point estimate from a single sample alone may not be enough.
We also need to report some notion of *uncertainty* in the value of the point
estimate.
Unfortunately, we cannot construct the exact sampling distribution without
full access to the population. However, if we could somehow *approximate* what
the sampling distribution would look like for a sample, we could
use that approximation to then report how uncertain our sample
point estimate is (as we did above with the *exact* sampling
distribution). There are several methods to accomplish this; in this book, we
will use the \index{bootstrap} *bootstrap*. We will discuss **interval estimation** and
construct \index{confidence interval}\index{interval|see{confidence interval}}
**confidence intervals** using just a single sample from a population. A
confidence interval is a range of plausible values for our population parameter.
Here is the key idea. First, if you take a big enough sample, it *looks like*
the population. Notice the histograms' shapes for samples of different sizes
taken from the population in Figure \@ref(fig:11-example-bootstrapping0). We
see that the sample’s distribution looks like that of the population for a
large enough sample.
```{r 11-example-bootstrapping0, echo = FALSE, message = FALSE, warning = FALSE, fig.height = 6.8, fig.cap = "Comparison of samples of different sizes from the population."}
sample_10 <- airbnb |>
rep_sample_n(10)
sample_distribution_10 <- ggplot(sample_10, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (Canadian dollars)") +
ylab("Count") +
ggtitle("n = 10")
sample_20 <- airbnb |>
rep_sample_n(20)
sample_distribution_20 <- ggplot(sample_20, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (Canadian dollars)") +
ylab("Count") +
ggtitle("n = 20")
sample_50 <- airbnb |>
rep_sample_n(50)
sample_distribution_50 <- ggplot(sample_50, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (Canadian dollars)") +
ylab("Count") +
ggtitle("n = 50")
sample_100 <- airbnb |>
rep_sample_n(100)
sample_distribution_100 <- ggplot(sample_100, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (Canadian dollars)") +
ylab("Count") +
ggtitle("n = 100")
sample_200 <- airbnb |>
rep_sample_n(200)
sample_distribution_200 <- ggplot(sample_200, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
xlab("Price per night (Canadian dollars)") +
ylab("Count") +
ggtitle("n = 200")
grid.arrange(sample_distribution_10 + xlim(min(airbnb$price), 600),
sample_distribution_20 +
xlim(min(airbnb$price), 600),
sample_distribution_50 +
xlim(min(airbnb$price), 600),
sample_distribution_100 +
xlim(min(airbnb$price), 600),
sample_distribution_200 +
xlim(min(airbnb$price), 600),
population_distribution +
ggtitle("Population distribution") +
xlim(min(airbnb$price), 600),
ncol = 2
)
# widths = c(2, 3),
# layout_matrix = rbind(c(1, 2),
# c(1, 3),
# c(1, 4),
# c(1, 5)))
```
In the previous section, we took many samples of the same size *from our
population* to get a sense of the variability of a sample estimate. But if our
sample is big enough that it looks like our population, we can pretend that our
sample *is* the population, and take more samples (with replacement) of the
same size from it instead! This very clever technique is
called **the bootstrap**. Note that by taking many samples from our single, observed
sample, we do not obtain the true sampling distribution, but rather an
approximation that we call **the bootstrap distribution**. \index{bootstrap!distribution}
\newpage
> **Note:** We must sample *with* replacement when using the bootstrap.
> Otherwise, if we had a sample of size $n$, and obtained a sample from it of
> size $n$ *without* replacement, it would just return our original sample!
This section will explore how to create a bootstrap distribution from a single
sample using R. The process is visualized in Figure \@ref(fig:11-intro-bootstrap-image).
For a sample of size $n$, you would do the following:
1. Randomly select an observation from the original sample, which was drawn from the population.
2. Record the observation's value.
3. Replace that observation.
4. Repeat steps 1–3 (sampling *with* replacement) until you have $n$ observations, which form a bootstrap sample.
5. Calculate the bootstrap point estimate (e.g., mean, median, proportion, slope, etc.) of the $n$ observations in your bootstrap sample.
6. Repeat steps 1–5 many times to create a distribution of point estimates (the bootstrap distribution).
7. Calculate the plausible range of values around our observed point estimate.
```{r 11-intro-bootstrap-image, echo = FALSE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Overview of the bootstrap process.", fig.retina = 2, out.width="100%"}
knitr::include_graphics("img/intro-bootstrap.jpeg")
```
### Bootstrapping in R
Let’s continue working with our Airbnb example to illustrate how we might create
and use a bootstrap distribution using just a single sample from the population.
Once again, suppose we are
interested in estimating the population mean price per night of all Airbnb
listings in Vancouver, Canada, using a single sample size of 40.
Recall our point estimate was \$`r round(estimates$sample_mean, 2)`. The
histogram of prices in the sample is displayed in Figure \@ref(fig:11-bootstrapping1).
```{r, echo = F, message = F, warning = F}
one_sample <- one_sample |>
ungroup() |> select(-replicate)
```
```{r 11-bootstrapping1, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Histogram of price per night (Canadian dollars) for one sample of size 40.", fig.height = 3.5, fig.width = 4.5}
one_sample
one_sample_dist <- ggplot(one_sample, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Price per night (Canadian dollars)", y = "Count") +
theme(text = element_text(size = 12))
one_sample_dist
```
The histogram for the sample is skewed, with a few observations out to the right. The
mean of the sample is \$`r round(estimates$sample_mean, 2)`.
Remember, in practice, we usually only have this one sample from the population. So
this sample and estimate are the only data we can work with.
We now perform steps 1–5 listed above to generate a single bootstrap
sample in R and calculate a point estimate from that bootstrap sample. We will
use the `rep_sample_n` function as we did when we were
creating our sampling distribution. But critically, note that we now
pass `one_sample`—our single sample of size 40—as the first argument.
And since we need to sample with replacement,
we change the argument for `replace` from its default value of `FALSE` to `TRUE`.
\index{bootstrap!in R}
\index{rep\_sample\_n!bootstrap}
```{r 11-bootstrapping3, echo = TRUE, message = FALSE, fig.pos = "H", out.extra="", warning = FALSE, fig.cap = "Bootstrap distribution.", fig.height = 3.5, fig.width = 4.5}
boot1 <- one_sample |>
rep_sample_n(size = 40, replace = TRUE, reps = 1)
boot1_dist <- ggplot(boot1, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Price per night (Canadian dollars)", y = "Count") +
theme(text = element_text(size = 12))
boot1_dist
summarize(boot1, mean = mean(price))
```
Notice in Figure \@ref(fig:11-bootstrapping3) that the histogram of our bootstrap sample
has a similar shape to the original sample histogram. Though the shapes of
the distributions are similar, they are not identical. You'll also notice that
the original sample mean and the bootstrap sample mean differ. How might that
happen? Remember that we are sampling with replacement from the original
sample, so we don't end up with the same sample values again. We are *pretending*
that our single sample is close to the population, and we are trying to
mimic drawing another sample from the population by drawing one from our original
sample.
Let's now take 20,000 bootstrap samples from the original sample (`one_sample`)
using `rep_sample_n`, and calculate the means for
each of those replicates. Recall that this assumes that `one_sample` *looks like*
our original population; but since we do not have access to the population itself,
this is often the best we can do.
```{r 11-bootstrapping4, echo = TRUE, message = FALSE, warning = FALSE}
boot20000 <- one_sample |>
rep_sample_n(size = 40, replace = TRUE, reps = 20000)
boot20000
tail(boot20000)
```
Let's take a look at histograms of the first six replicates of our bootstrap samples.
```{r 11-bootstrapping-six-bootstrap-samples, echo = TRUE, fig.pos = "H", out.extra="", message = FALSE, warning = FALSE, fig.cap = "Histograms of first six replicates of bootstrap samples."}
six_bootstrap_samples <- boot20000 |>
filter(replicate <= 6)
ggplot(six_bootstrap_samples, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Price per night (Canadian dollars)", y = "Count") +
facet_wrap(~replicate) +
theme(text = element_text(size = 12))
```
We see in Figure \@ref(fig:11-bootstrapping-six-bootstrap-samples) how the
bootstrap samples differ. We can also calculate the sample mean for each of
these six replicates.
```{r 11-bootstrapping-six-bootstrap-samples-means, echo = TRUE, message = FALSE, warning = FALSE}
six_bootstrap_samples |>
group_by(replicate) |>
summarize(mean = mean(price))
```
We can see that the bootstrap sample distributions and the sample means are
different. They are different because we are sampling *with replacement*. We
will now calculate point estimates for our 20,000 bootstrap samples and
generate a bootstrap distribution of our point estimates. The bootstrap
distribution (Figure \@ref(fig:11-bootstrapping5)) suggests how we might expect
our point estimate to behave if we took another sample.
```{r 11-bootstrapping5, echo = TRUE, message = FALSE, warning = FALSE, fig.pos = "H", out.extra="", fig.cap = "Distribution of the bootstrap sample means.", fig.height = 3.5, fig.width = 4.5}
boot20000_means <- boot20000 |>
group_by(replicate) |>
summarize(mean = mean(price))
boot20000_means
tail(boot20000_means)
boot_est_dist <- ggplot(boot20000_means, aes(x = mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
labs(x = "Sample mean price per night \n (Canadian dollars)", y = "Count") +
theme(text = element_text(size = 12))
boot_est_dist
```
Let's compare the bootstrap distribution—which we construct by taking many samples from our original sample of size 40—with
the true sampling distribution—which corresponds to taking many samples from the population.
```{r 11-bootstrapping6, echo = F, message = FALSE, warning = FALSE, fig.cap = "Comparison of the distribution of the bootstrap sample means and sampling distribution.", fig.height = 3.5}
samples <- rep_sample_n(airbnb, size = 40, reps = 20000)
sample_estimates <- samples |>
group_by(replicate) |>
summarize(sample_mean = mean(price))
sampling_dist <- ggplot(sample_estimates, aes(x = sample_mean)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
ylab("Count") +
xlab("Sample mean price per night \n (Canadian dollars)")
annotated_sampling_dist <- sampling_dist +
xlim(min_x(sampling_dist), max_x(sampling_dist)) +
geom_vline(xintercept = mean(sample_estimates$sample_mean), col = "red") +
annotate("text",
x = max_x(sampling_dist), y = max_count(sampling_dist),
hjust = 1,
vjust = 1,
label = paste("mean = ", round(mean(sample_estimates$sample_mean), 1)))
boot_est_dist_limits <- boot_est_dist +
xlim(min_x(sampling_dist), max_x(sampling_dist))
annotated_boot_est_dist <- boot_est_dist_limits +
geom_vline(xintercept = mean(boot20000_means$mean), col = "red") +
annotate("text",
x = max_x(sampling_dist), y = max_count(boot_est_dist_limits),
vjust = 1,
hjust = 1,
label = paste("mean = ", round(mean(boot20000_means$mean), 1)))
grid.arrange(annotated_sampling_dist + ggtitle("Sampling distribution"),
annotated_boot_est_dist + ggtitle("Bootstrap distribution"),
ncol = 2
)
```
There are two essential points that we can take away from Figure
\index{sampling distribution!compared to bootstrap distribution}
\@ref(fig:11-bootstrapping6). First, the shape and spread of the true sampling
distribution and the bootstrap distribution are similar; the bootstrap
distribution lets us get a sense of the point estimate's variability. The
second important point is that the means of these two distributions are
different. The sampling distribution is centered at
\$`r round(mean(airbnb$price),2)`, the population mean value. However, the bootstrap
distribution is centered at the original sample's mean price per night,
\$`r round(mean(boot20000_means$mean), 2)`. Because we are resampling from the
original sample repeatedly, we see that the bootstrap distribution is centered
at the original sample's mean value (unlike the sampling distribution of the
sample mean, which is centered at the population parameter value).
Figure
\@ref(fig:11-bootstrapping7) summarizes the bootstrapping process.
The idea here is that we can use this distribution of bootstrap sample means to
approximate the sampling distribution of the sample means when we only have one
sample. Since the bootstrap distribution pretty well approximates the sampling
distribution spread, we can use the bootstrap spread to help us develop a
plausible range for our population parameter along with our estimate!
```{r 11-bootstrapping7, echo = F, message = FALSE, warning = FALSE, fig.cap = "Summary of bootstrapping process."}
pop_dist <- population_distribution + ggtitle("Population") + xlab("Price") +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank()
)
sam_dist <- one_sample_dist + ggtitle("Sample") + xlab("Price") +
theme(
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank()
)
set.seed(2)
boot2 <- one_sample |>
rep_sample_n(size = 40, replace = TRUE, reps = 1)
set.seed(3)
boot3 <- one_sample |>
rep_sample_n(size = 40, replace = TRUE, reps = 1)
boot1_dist <- boot1_dist + ggtitle("Samples with Replacement") +
theme(
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank()
)
boot2_dist <- ggplot(boot2, aes(price)) +
geom_histogram(fill = "dodgerblue3", color = "lightgrey") +
theme(
axis.title.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank()
)