From 5eb407cca6cc5ed7b3734a7ba70716866ef7e7ba Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Wed, 20 Mar 2024 01:12:56 +0000 Subject: [PATCH] Render site --- help.html | 8 +- .../Data_Classes/lab/Data_Classes_Lab.html | 16 +- .../Data_Cleaning/lab/Data_Cleaning_Lab.html | 25 +- modules/Data_Input/lab/Data_Input_Lab.html | 2 +- modules/Data_Output/lab/Data_Output_Lab.html | 347 ++++++++--- .../lab/Data_Summarization_Lab.html | 13 - .../lab/Data_Visualization_Lab.html | 25 +- .../lab/Esquisse_Data_Visualization_Lab.html | 343 ++++++++--- modules/Factors/lab/Factors_Lab.html | 371 +++++++++--- modules/HW/homework2.html | 539 +++++++++++++++++ modules/HW/homework3.html | 544 ++++++++++++++++++ .../lab/Manipulating_Data_in_R_Lab.html | 4 +- modules/RStudio/lab/RStudio_Lab.html | 18 +- modules/Statistics/lab/Statistics_Lab.html | 391 ++++++++++--- .../lab/Subsetting_Data_in_R_Lab.html | 22 +- 15 files changed, 2272 insertions(+), 396 deletions(-) create mode 100644 modules/HW/homework2.html create mode 100644 modules/HW/homework3.html diff --git a/help.html b/help.html index 00f74a4e3..0e4c8acbc 100644 --- a/help.html +++ b/help.html @@ -405,16 +405,16 @@

Why are my changes not taking effect? It’s making my results

Here we are creating a new object from an existing one:

new_rivers <- sample(rivers, 5)
 new_rivers
-
## [1] 900 890 780 430 250
+
## [1]  350 2533  246  270  360

Using just this will only print the result and not actually change new_rivers:

new_rivers + 1
-
## [1] 901 891 781 431 251
+
## [1]  351 2534  247  271  361

If we want to modify new_rivers and save that modified version, then we need to reassign new_rivers like so:

new_rivers <- new_rivers + 1
 new_rivers
-
## [1] 901 891 781 431 251
+
## [1]  351 2534  247  271  361

If we forget to reassign this can cause subsequent steps to not work as expected because we will not be working with the data that has been modified.

@@ -484,7 +484,7 @@

Error: object ‘X’ not found

operator:

rivers2 <- new_rivers + 1
 rivers2
-
## [1] 902 892 782 432 252
+
## [1]  352 2535  248  272  362

Part 1

library(readr)
 library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
-## ✔ dplyr     1.1.2     ✔ purrr     1.0.1
-## ✔ forcats   1.0.0     ✔ stringr   1.5.0
-## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
-## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
+## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
+## ✔ forcats   1.0.0     ✔ stringr   1.5.1
+## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
+## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 ## ✖ dplyr::filter() masks stats::filter()
 ## ✖ dplyr::lag()    masks stats::lag()
@@ -391,10 +391,6 @@ 

Part 1

6. Read in the Charm City Circulator data using read_circulator() function from jhur package using the code supplied in the chunk.

-
    -
  • Use the str() function to take a look at the data and -learn about the column types.
  • -
circ <- read_circulator()
## Rows: 1146 Columns: 15
 ## ── Column specification ────────────────────────────────────────────────────────
@@ -404,6 +400,10 @@ 

Part 1

## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
    +
  • Use the str() function to take a look at the data and +learn about the column types.
  • +

7. Use the mutate() function to create a new column named date_formatted that is of Date class. The new variable is created from date column. Hint: use diff --git a/modules/Data_Cleaning/lab/Data_Cleaning_Lab.html b/modules/Data_Cleaning/lab/Data_Cleaning_Lab.html index 3a81dfb4e..0912bc359 100644 --- a/modules/Data_Cleaning/lab/Data_Cleaning_Lab.html +++ b/modules/Data_Cleaning/lab/Data_Cleaning_Lab.html @@ -11,7 +11,7 @@ -Data Cleaning Lab Key +Data Cleaning Lab - + + - - - - + + + + - - +h1.title {font-size: 38px;} +h2 {font-size: 30px;} +h3 {font-size: 24px;} +h4 {font-size: 18px;} +h5 {font-size: 16px;} +h6 {font-size: 12px;} +code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} +pre:not([class]) { background-color: white } + + +code{white-space: pre-wrap;} +span.smallcaps{font-variant: small-caps;} +span.underline{text-decoration: underline;} +div.column{display: inline-block; vertical-align: top; width: 50%;} +div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} +ul.task-list{list-style: none;} + - + + - - - - + + + + - - +h1.title {font-size: 38px;} +h2 {font-size: 30px;} +h3 {font-size: 24px;} +h4 {font-size: 18px;} +h5 {font-size: 16px;} +h6 {font-size: 12px;} +code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} +pre:not([class]) { background-color: white } + + +code{white-space: pre-wrap;} +span.smallcaps{font-variant: small-caps;} +span.underline{text-decoration: underline;} +div.column{display: inline-block; vertical-align: top; width: 50%;} +div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} +ul.task-list{list-style: none;} + - + + - - - - + + + + - - +h1.title {font-size: 38px;} +h2 {font-size: 30px;} +h3 {font-size: 24px;} +h4 {font-size: 18px;} +h5 {font-size: 16px;} +h6 {font-size: 12px;} +code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} +pre:not([class]) { background-color: white } + + +code{white-space: pre-wrap;} +span.smallcaps{font-variant: small-caps;} +span.underline{text-decoration: underline;} +div.column{display: inline-block; vertical-align: top; width: 50%;} +div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} +ul.task-list{list-style: none;} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ + + + + + + +
+

Instructions

+

Completed homework should be submitted on CoursePlus as an +Rmd file. Please see the course website for more +information about submitting assignments: https://jhudatascience.org/intro_to_r/syllabus.html#submitting-assignments.

+

Homework will be graded for correct output, not code style. All +assignments are due at the end of the course. Please see the course +website for more information about grading: https://jhudatascience.org/intro_to_r/syllabus.html#grading.

+
## you can add more, or change...these are suggestions
+library(tidyverse)
+library(readr)
+library(dplyr)
+library(ggplot2)
+library(tidyr)
+
+
+

Problem Set

+

1. Create the following two objects.

+
    +
  1. Make an object “bday”. Assign it your birthday in day-month format +(1-Jan).
  2. +
  3. Make another object “name”. Assign it your name. Make sure to use +quotation marks for anything with text!
  4. +
+

2. Make an object “me” that is “bday” and “name” combined.

+

3. Determine the data class for “me”.

+

4. If I want to do me / 2 I get the following error: +Error in me/2 : non-numeric argument to binary operator. +Why? Write your answer as a comment inside the R chunk below.

+

The following questions involve an outside +dataset.

+

We will be working with a dataset from the “Kaggle” website, which +hosts competitions for prediction and machine learning. More details on +this dataset are here: https://www.kaggle.com/c/DontGetKicked/overview/background.

+

5. Bring the dataset into R. The dataset is located at: https://jhudatascience.org/intro_to_r/data/kaggleCarAuction.csv. +You can use the link, download it, or use whatever method you like for +getting the file. Once you get the file, read the dataset in using +read_csv() and assign it the name cars.

+

6. Import the data “dictionary” from https://jhudatascience.org/intro_to_r/data/Carvana_Data_Dictionary_formatted.txt. +Use the read_tsv() function and assign it the name +“key”.

+

7. You should now be ready to work with the “cars” dataset.

+
    +
  1. Preview the data so that you can see the names of the columns. There +are several possible functions to do this.
  2. +
  3. Determine the class of the first three columns using +str(). Write your answer as a comment inside the R chunk +below.
  4. +
+

8. How many cars (rows) are in the dataset? How many variables +(columns) are recorded for each car?

+

9. Filter out (i.e., remove) any vehicles that cost less than or +equal to $5000 (“VehBCost”) or that have missing values. Replace the +original “cars” object by reassigning the new filtered dataset to +“cars”. How many vehicles are left after filtering?

+

Hint: The filter() function also +removes missing values.

+

10. From this point on, work with the filtered “cars” dataset from +the above question. Given the average car loan today is 70 months, +create a new variable (column) called “MonthlyPrice” that shows the +monthly cost for each car (Divide “VehBCost” by 70). Check to make sure +the new column is there.

+

Hint: use the mutate() function.

+

11. What is the range of the manufacture year (“VehYear”) of the +vehicles?

+

12. Create a random sample with of mileage (odometer reading) from +cars. To determine the column that corresponds to mileage +(The vehicle’s odometer reading), check the “key” corresponding to the +data dictionary that you imported above in question 6. Use +sample() and pull(). Remember that by default +random samples differ each time you run the code.

+

13. How many cars were from before 2004? What percent/proportion do +these represent? Use:

+
    +
  • filter() and nrow()
  • +
  • group_by() and summarize() or
  • +
  • sum()
  • +
+

14. How many different vehicle manufacturers/makes (“Make”) are +there?

+

Hint: use length() with +unique() or table(). Remember to +pull() the right column.

+

15. How many different vehicle models (“Model”) are there?

+

16. Which vehicle color group had the highest mean acquisition cost +paid for the vehicle at time of purchase, and what was this cost?

+

Hint: Use group_by() with +summarize(). To determine the column that corresponds to +“acquisition cost paid for the vehicle at time of purchase”, check the +“key” corresponding to the data dictionary that you imported above in +question 6.

+

17. Extend on the code you wrote for question 16. Use the +arrange() function to sort the output by mean acquisition +cost.

+

18. How many vehicles were red and have fewer than 30,000 miles? To +determine the column that corresponds to mileage (The vehicle’s odometer +reading), check the “key” corresponding to the data dictionary that you +imported above in question 6. use:

+
    +
  • filter() and count()
  • +
  • filter() and tally() or
  • +
  • sum()
  • +
+

19. How many vehicles are blue or red? use:

+
    +
  • filter() and count()
  • +
  • filter() and tally() or
  • +
  • sum()
  • +
+

20. Select all columns in “cars” where the column names starts with +“Veh” (using select() and starts_with(). Then, +use colMeans() to summarize across these columns.

+
+

The following questions are not required for full credit, but can +make up for any points lost on other questions.

+
+
+
+

Bonus Practice

+

A. Using “cars”, create a new binary (TRUEs and FALSEs) column to +indicate if the car has an automatic transmission. Call the new column +“is_automatic”.

+

B. What is the average vehicle odometer reading for cars that are +both RED and NISSANs? How does this compare with vehicles that do NOT +fit this criteria?

+

C. Among red Nissans, what is the distribution of vehicle ages?

+

D. How many vehicles (using filter() or +sum() ) are made by Chrysler or Nissan and are white or +silver?

+

E. Make a boxplot (boxplot()) that looks at vehicle age +(“VehicleAge”) on the x-axis and odometer reading (“VehOdo”) on the +y-axis.

+

F. Knit your document into a report.

+

You use the knit button to do this. Make sure all your code is +working first!

+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/modules/HW/homework3.html b/modules/HW/homework3.html new file mode 100644 index 000000000..d6e4d84b7 --- /dev/null +++ b/modules/HW/homework3.html @@ -0,0 +1,544 @@ + + + + + + + + + + + + + +Introduction to R: Homework 3 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Instructions

+

Completed homework should be submitted on CoursePlus as an +Rmd file. Please see the course website for more +information about submitting assignments: https://jhudatascience.org/intro_to_r/syllabus.html#submitting-assignments.

+

Homework will be graded for correct output, not code style. All +assignments are due at the end of the course. Please see the course +website for more information about grading: https://jhudatascience.org/intro_to_r/syllabus.html#grading.

+
## you can add more, or change...these are suggestions
+library(tidyverse)
+library(readr)
+library(dplyr)
+library(ggplot2)
+library(tidyr)
+
+
+

Problem Set

+

1. Bring the following dataset into R.

+ +

2. Run the colnames() function to take a look at the +dataset column names. You should see that there was originally no name +for the first column and that R replaced it with “…1”. Rename the first +column of “mort” to “country” using the rename() function +in dplyr.

+

3. Select only the numeric type columns (select()). +Then, create the variable “year” from column names by using the +colnames() function to extract them.

+

4. What is the typeof() for “year”? If it’s not an +integer, turn it into integer form with as.integer().

+

5. Use the pct_complete() function in the +naniar package to determine the percent missing data in +“mort”. You might need to load and install naniar!

+

6. Are there any countries that have a complete record in “mort” +across all years? Just look at the output here, don’t reassign it. +Hint: look for complete records by dropping all NAs +from the dataset using drop_na().

+

7. Reshape the “complete” data to long form.

+
    +
  • There should be a column for country (“country”), a column for year +(“year”), and a column for the mortality value (“mortality”).
  • +
  • Use pivot_longer().
  • +
  • You should pivot all columns except “country”.
  • +
  • Hint: listing !COLUMN or +-COLUMN means everything except COLUMN.
  • +
  • Assign the reshaped data to “long”.
  • +
+

8. Bring an additional dataset into R.

+
    +
  • The dataset is tab-delimited and located at: https://jhudatascience.org/intro_to_r/data/country_pop.txt.
  • +
  • You can use the link, download it, or use whatever method you like +for getting the file.
  • +
  • Once you get the file, read the dataset in using +read_tsv() and assign it the name “pop”.
  • +
+

9. Rename the second column in “pop” to “country” and the column “% +of world population”, to “percent”. Use the rename() +function. Don’t forget to reassign the renamed data to “pop”.

+

10. Sort the data in “pop” by “Population” from largest to smallest +using arrange() and desc(). After sorting, +select() “country” to create an one-column tibble of +countries ordered by population. Assign this data the name +“country_ordered”.

+

11. Subset “long” based on years 2000-2010, including 2000 and 2010 +and call this “long_sub” using & or the +between() function. Confirm your filtering worked by +looking at the range of “year”. If you’re getting a strange error, make +sure you created the “year” column in problem #7.

+

12. Further subset long_sub. You will filter for +specific countries using filter() and the %in% +operator. Only include countries in this list: +c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Canada"). +Make sure to reassign to “long_sub”.

+

13. Use pivot_wider() to turn the “year” column of +“long_sub” into multiple columns, each representing a different year. +Fill values (values_from=) with “mortality”. Assign this +pivoted dataset the name “mort_sub”.

+

14. Using “country_ordered” and “mort_sub”, right_join() +the two datasets by “country”. Use the pipe %>% to join +this dataset to “pop”, keeping only the data on the lefthand side of the +join. Call this “joined”.

+

15. The values in the table are percentages of the total population +(not proportion).

+
    +
  • Create a new column called “mort_count” that estimates the total +number of child deaths per year based on the total population. You can +use (a) any year or (b) an average of all years to make your +calculation. Whichever you choose, justify your choice.
  • +
  • Finally, select() only “country”, “Population”, and +“mort_count” and view the data.
  • +
+
+

Justification is just for fun. The main point is that decisions in +your analysis should depend on your reasoning not how many lines of code +it takes :)

+
+
+

The following questions are not required for full credit, but can +make up for any points lost on other questions.

+
+
+
+

Bonus Practice

+

A. Bring the following dataset into R.

+
    +
  • The dataset is located at: https://jhudatascience.org/intro_to_r/data/asthma.xlsx.
  • +
  • You should download the associated data.
  • +
  • Once you get the file, read the dataset in using +read_excel() from the readxl package and +assign it the name “asthma”.
  • +
  • Read in the sheet named “Age Group (Years)”.
  • +
+

B. Rename the column Weighted Number With Current Asthma +to “asthma_count” using rename(). Replace the original +“asthma” object by calling the new dataset “asthma”.

+

C. Separate Percent (SE) into two separate columns: +“percent” and “SE” using the separate() function. Replace +the original “asthma” object by calling the new dataset “asthma”.

+

D. Remove the parentheses around the numbers in the new SE column. +You should use a combination of str_replace(), +pull() (because stringr package functions work on vectors +not dataframes!) and mutate(). Replace the original +“asthma” object by calling the new dataset “asthma”.

+
    +
  • The pattern = to find the starting parenthesis is +“[(]”
  • +
  • The pattern = to find for the ending parenthesis is +“[)]”
  • +
  • The replacement = for both can be empty quotation +marks: “”
  • +
+

E. Determine the class of “percent” and “SE”. Can you take the mean +values? Why or why not?

+

F. Use as.numeric() to convert “percent” and “SE” to +numeric class. Calculate the mean for both.

+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.html b/modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.html index 2173a918d..cc0984a9a 100644 --- a/modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.html +++ b/modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.html @@ -11,7 +11,7 @@ -Manipulating Data in R Lab - Key +Manipulating Data in R Lab - + + - - - - + + + + - - +h1.title {font-size: 38px;} +h2 {font-size: 30px;} +h3 {font-size: 24px;} +h4 {font-size: 18px;} +h5 {font-size: 16px;} +h6 {font-size: 12px;} +code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} +pre:not([class]) { background-color: white } + + +code{white-space: pre-wrap;} +span.smallcaps{font-variant: small-caps;} +span.underline{text-decoration: underline;} +div.column{display: inline-block; vertical-align: top; width: 50%;} +div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} +ul.task-list{list-style: none;} +