generated from othneildrew/Best-README-Template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
R-WORKSHOP.Rmd
292 lines (198 loc) · 9.41 KB
/
R-WORKSHOP.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "Data Science Project"
output: html_document
date: "2023-01-10"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
Exploratory data analysis is a data analysis approach that involves building an intuitive understanding of the dataset at hand. That is what you're being asked to do in this project!
Requirements
Please pick a dataset you will use for your analysis. Please choose from the following (I know these work and are good):
Explore your data variables and decide what questions you want to ask and answer.
1. Upload your data set to Kaggle, Radiant or R studio OR VIEW data sets.
2. Ask 2 questions that can be answered using a correlation analyses. Make a copy of this document and fill out information for each correlation.
Correlation analysis Template
https://docs.google.com/document/d/1h2zhx6STiFp7SpXzy_Q0cHm6rQIntbxO2vEel9_juaY/edit?usp=sharing
3. Ask questions that can be answered using a T-Test or ANOVA. Find the p value for your questions.
Complete 3 ANOVA Analyses Choose your own data sets.
Explanation Template -https://docs.google.com/document/d/1WVw8JKT13BRnZZlCbNMTfIG1GTDQM4puHbMB9QMqQs4/edit?usp=sharing
4. Create 1 visuals per hypothesis or question.
5. Get a basic statistics summary to explore your data. There is not only one right answer for how to explore your data.
1. Choose 1-3 datasets.
2. Lets look at the column names
3. You can look at the names of the columns in the dataset with the `names()` function
4. View the entire dataset in a new window
5. Find the mean and max of a variable in your dataset
6. Create a scatterplot
7. make another pirateplot showing the relationship between 2 variables using a different plotting theme and the `"pony"` color palette:
8. Do 2 fifferent t-tests to test a 2 hypothesis ?
9. Complete a ANOVA
10. Create a linear regression model
Below we will first load some libraries and load a bunch of data sets to choose from.
```{r cars}
library(yarrr)
library("datasets")
library(tidyverse) # metapackage of all tidyverse packages
install.packages('yarrr')
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
list.files(path = "../input")
library(yarrr)
```
## Library datasets
```{r pressure, echo=FALSE}
library(help = "datasets")
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
```{r pressure, echo=FALSE}
?sleep
```
Lets look at the column names
```{r}
head(sleep)
```
## Exploring data
Next, we'll look at the help menu for the pirates dataset using the question mark `?pirates`. When you run this, you should see a small help window open up in RStudio that gives you some information about the dataset.
```{r, eval = FALSE}
?pirates
```
First, let's take a look at the first few rows of the dataset using the `head()` function. This will show you the first few rows of the data.
```{r}
# Look at the first few rows of the data
head(pirates)
```
You can look at the names of the columns in the dataset with the `names()` function
```{r}
# What are the names of the columns?
names(pirates)
```
Finally, you can also view the entire dataset in a separate window using the `View()` function:
```{r eval = FALSE}
# View the entire dataset in a new window
View(pirates)
```
## Descriptive statistics
Now let's calculate some basic statistics on the entire dataset. We'll calculate the mean age, maximum height, and number of pirates of each sex:
```{r}
# What is the mean age?
mean(pirates$age)
# What was the tallest pirate?
max(pirates$height)
# How many pirates are there of each sex?
table(pirates$sex)
```
Now, let's calculate statistics for different groups of pirates. For example, the following code will use the `aggregate()` function to calculate the mean age of pirates, separately for each sex.
```{r}
# Calculate the mean age, separately for each sex
aggregate( age ~ sex,
data = pirates,
FUN = mean)
```
## Plotting
Cool stuff, now let's make a plot! We'll plot the relationship between pirate's height and weight using the `plot()` function
```{r}
# Create scatterplot
plot(x = pirates$height, # X coordinates
y = pirates$weight) # y-coordinates
```
```{r}
```
Now let's make a fancier version of the same plot by adding some customization
```{r}
# Create scatterplot
plot(x = pirates$height, # X coordinates
y = pirates$weight, # y-coordinates
main = 'My first scatterplot of pirate data!',
xlab = 'Height (in cm)', # x-axis label
ylab = 'Weight (in kg)', # y-axis label
pch = 16, # Filled circles
col = gray(.0, .1)) # Transparent gray
```
Now let's make it even better by adding gridlines and a blue regression line to measure the strength of the relationship.
```{r}
# Create scatterplot
plot(x = pirates$height, # X coordinates
y = pirates$weight, # y-coordinates
main = 'My first scatterplot of pirate data!',
xlab = 'Height (in cm)', # x-axis label
ylab = 'Weight (in kg)', # y-axis label
pch = 16, # Filled circles
col = gray(.0, .1)) # Transparent gray
grid() # Add gridlines
# Create a linear regression model
model <- lm(formula = weight ~ height,
data = pirates)
abline(model, col = 'blue') # Add regression to plot
```
Scatterplots are great for showing the relationship between two continuous variables, but what if your independent variable is not continuous? In this case, pirateplots are a good option. Let's create a pirateplot using the `pirateplot()` function to show the distribution of pirate's age based on their favorite sword:
```{r}
pirateplot( age ~ sword.type,
data = pirates,
main = "Pirateplot of ages by favorite sword")
```
Now let's make another pirateplot showing the relationship between sex and height using a different plotting theme and the `"pony"` color palette:
```{r}
pirateplot( height ~ sex, # Plot weight as a function of sex
data = pirates,
main = "Pirateplot of height by sex",
pal = "pony", # Use the info color palette
theme = 3) # Use theme 3
```
The `"pony"` palette is contained in the `piratepal()` function. Let's see where the `"pony"` palette comes from...
```{r}
# Show me the pony palette!
piratepal(palette = "pony",
plot.result = TRUE, # Plot the result
trans = .1) # Slightly transparent
```
## Hypothesis tests
Now, let's do some basic hypothesis tests. First, let's conduct a two-sample t-test to see if there is a significant difference between the ages of pirates who do wear a headband, and those who do not:
```{r}
# Age by headband t-test
t.test(age ~ headband,
data = pirates,
alternative = 'two.sided')
```
#With a p-value of 0.7259, we don't have sufficient evidence to say there is a difference in the mean age of pirates who wear headbands and those who do not.
#Next, let's test if there is a significant correlation between a pirate's height and weight using the `cor.test()` function:
```{r}
cor.test(formula = ~ height + weight,
data = pirates)
```
We got a p-value of `p < 2.2e-16`, that's scientific notation for `p < .00000000000000016` -- which is pretty much 0. Thus, we'd conclude that there is a significant (positive) relationship between a pirate's height and weight.
Now, let's do an ANOVA testing if there is a difference between the number of tattoos pirates have based on their favorite sword
```{r}
# Create tattoos model
tat.sword.lm <- lm(formula = tattoos ~ sword.type,
data = pirates)
# Get ANOVA table
anova(tat.sword.lm)
```
Sure enough, we see another very small p-value of `p < 2.2e-16`, suggesting that the number of tattoos pirates have are different based on their favorite sword.
## Regression analysis
Finally, let's run a regression analysis to see if a pirate's age, weight, and number of tattoos (s)he has predicts how many treasure chests he/she's found:
```{r}
# Create a linear regression model: DV = tchests, IV = age, weight, tattoos
tchests.model <- lm(formula = tchests ~ age + weight + tattoos,
data = pirates)
# Show summary statistics
summary(tchests.model)
```
It looks like the only significant predictor of the number of treasure chests that a pirate has found is his/her age. There does not seem to be significant effect of weight or tattoos.
## Bayesian Statistics
Now, let's repeat some of our previous analyses with Bayesian versions. First we'll install and load the \texttt{BayesFactor} package which contains the Bayesian statistics functions we'll use:
```{r eval = FALSE}
# Install and load the BayesFactor package
install.packages('BayesFactor')
library(BayesFactor)
```
Now that the package is installed and loaded, we're good to go! Let's do a Bayesian version of our earlier t-test asking if pirates who wear a headband are older or younger than those who do not.
```{r}
# Bayesian t-test comparing the age of pirates with and without headbands
ttestBF(formula = age ~ headband,
data = pirates)
```
It looks like we got a Bayes factor of 0.12 which is strong evidence *for* the null hypothesis (that the mean age does not differ between pirates with and without headbands)