title | description | layout |
---|---|---|
Data munging – Combining Multiple data source |
From messy to tidy data |
default |
- Something very important is missing! We can't actually tell the party of our Assembly Members. Let's find this information somewhere else: https://ballotpedia.org/New_York_State_Assembly
- Can you figure out how to get that table straight into R without having to scrape every row? Let's call this
bpTable
- Hint: check back into the set of rvest commands
We can now merge the two datasets. The best variable for this is going to be the district, but first we'll need to make sure it is the same in both data frames. For that, we need to
- remove the "New York State " in front of the district and
- Rename the variables
For 1), there's a nifty command str_remove()
in the stringr
package:
bpTable$Office <- bpTable$Office |> str_remove("New York State ")
For 2), there are a ton of ways to do this. The dplyr
package has a particular useful one, called rename
. While we're using this, let's give the other variable useful names, too:
bpTable <-
bpTable |> rename(district = Office,
memberSince = `Date assumed office`,
party = Party)
We can now merge the two:
fullAssembly <- merge(assembly, bpTable, by = "district")
- We have an interesting variable
memberSince
that tells us when someone joined the Assembly. Can you get the average year people joined? - Turns out we need to clean that column up for it to be useful. For this we turn to:
-
We'll use a little bit of tool called "regular expressions" or "regex" to further clean the data
-
We will (again) only cover the basics. There's a cheatsheet e.g. here: https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
-
A dot serves as a wildcard: t.p matches tip, top, tap, t3p, t-p, etc.
-
A + means "repeat one or more time". So but+er matches buter, butter, buttter, etc.
- The + operator stops matching at line breaks
- You will often see
.+
, i.e. match everything until the end of the line
-
*
works almost like +, but matches zero or more times. So but*er matches buer, buter, butter, etc. -
\d matches any digit
-
\s matches a space
-
a number in curly brackets matches n times the preceding element, so
a{4}
matches aaaa. -
parentheses around parts of a regex let you select what's matched by the parentheses separately.
-
^
denotes the beginning of a line and $ it's end.
- In R, regex are implemented in the
stringr
package. Most importantly the functionstr_match(string, "regex")
. An analogous function isstr_remove(string, "regex")
(we've already used that) which removes a pattern from a string, andstr_extract(string, "regex")
which allows us to extract one part of a string quickly.
- We'll be OK with some loss of precision and simply extract the year from the memberSince column and turn it into a new variable,
yearSince
How would you go about doing that?
What sorts of analyses could we run with this data?