diff --git a/docs/intro.md b/docs/intro.md index 0302db5..1a9e937 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -28,7 +28,7 @@ The code presented on this website will presume you have downloaded the data fro To make using the datasets easier, we provide code reorganise the `.dta` files into a simple directory structure with a folder for each sweep. This code is described under each study section (e.g., `MCS -> Creating a Simple Folder Structure`). We will assume you have organised the files in this way in other code we present. -We use the `tidyverse` (an `R` package) extensively in the code presented on this website. If you are new to the `tidyverse`, we recommend Hadley Wickham and colleagues' book, R for Data Science, which is [available for free online](https://r4ds.had.co.nz/). +We use the `tidyverse` (an `R` package) extensively in the code presented on this website. If you are new to the `tidyverse`, we recommend Hadley Wickham and colleagues' book *R for Data Science* which is [available for free online](https://r4ds.had.co.nz/). We also link to sections in this book, where relevant. # Code Sharing This website can obviously not provide all the code you may need to carry out the analyses you may want to with CLS data. We have therefore set up the [`#britishcohorts` hashtag on GitHub Gist](https://gist.github.com/search?q=%23britishcohorts) for people to share code snippets that are useful for CLS analyses. Please consider sharing your own code snippets (for instance, code to derive a useful variable) on GitHub Gist adding the `#britishcohorts` hashtag and a study specific hashtag (`#mcs`, `#bcs70`, `#nextsteps`, `#ncds`) to the Gist description to make it findable. diff --git a/docs/mcs-merging_across_sweeps.md b/docs/mcs-merging_across_sweeps.md index 7443aae..9ba2e2e 100644 --- a/docs/mcs-merging_across_sweeps.md +++ b/docs/mcs-merging_across_sweeps.md @@ -1,6 +1,6 @@ --- layout: default -title: Combining Data *Within* Sweeps +title: Combining Data Across Sweeps nav_order: 4 parent: MCS format: docusaurus-md @@ -11,10 +11,28 @@ format: docusaurus-md # Introduction -In this tutorial, we will learn how to merge, collapse and reshape -various data structures from a given sweep of the Millennium Cohort -Study (MCS) to create a dataset at the cohort member level (one row per -cohort member). We will use the following packages: +In this section, we show how to combine MCS data across sweeps, assuming +the data to be merged are in a consistent format (e.g., one row per +family); for information on munging data to have a consistent structure +see the page [*Combining Data Within a +Sweep*](https://cls-data.github.io/docs/mcs-merging_within_sweep.html) + +As an example, we use data on cohort members’ height, which was recorded +in Sweeps 2-7 and is available in the `mcs[2-7]_cm_interview.dta` files. +These files contain one row per cohort-member. As a reminder, we have +organised the data files so that each sweep [has its own folder, which +is named according to the age of +follow-up](https://cls-data.github.io/docs/mcs-sweep_folders.html) +(e.g., 3y for the second sweep). + +We begin by combining data from the second and third sweeps, showing how +to combine these datasets in **wide** (one row per observational unit) +and **long** (multiple rows per observational unit) formats by *merging* +and *appending*, respectively. We then show how to combine data from +multiple sweeps *programmatically* using the `dplyr` and `purrr` +packages (from the `tidyverse`). + +We use the following packages: ```r # Load Packages @@ -22,204 +40,629 @@ library(tidyverse) # For data manipulation library(haven) # For importing .dta files ``` -The datasets we will use are: +# Merging Across Sweeps + +The variable `[B-G]CHTCM00` contains the height of the cohort member at +Sweep 2-7, except for Sweep 5, where the variable is called `ECHTCMA0`. +[The cohort-member identifiers are stored across two +variables](https://cls-data.github.io/docs/mcs-data_structures.html) in +the `mcs[2-7]_cm_interview.dta` files: `MCSID` and `[A-G]CNUM00`. +`MCSID` is the family identifier and `[A-G]CNUM00` identifies the cohort +member within the family. We will use the `read_dta()` function from +`haven` to read in the data from the second and third sweeps, specifying +the `col_select` argument to keep only the variables we need (the two +identifier variables and height). ```r -family <- read_dta("3y/mcs2_family_derived.dta") # One row per family -cm <- read_dta("3y/mcs2_cm_derived.dta") # One row per cohort member -parent <- read_dta("3y/mcs2_parent_derived.dta") # One row per parent (responding) -parent_cm <- read_dta("3y/mcs2_parent_cm_interview.dta") # One row per parent (responding) per cohort member -hhgrid <- read_dta("3y/mcs2_hhgrid.dta") # One row per household member +df_3y <- read_dta("3y/mcs2_cm_interview.dta", + col_select = c("MCSID", "BCNUM00", "BCHTCM00")) + +df_5y <- read_dta("5y/mcs3_cm_interview.dta", + col_select = c("MCSID", "CCNUM00", "CCHTCM00")) ``` -# Data Cleaning +We can merge these datasets by row using the `*_join()` family of +functions. These share a common syntax. They take two data frames (`x` +and `y`) as arguments, as well as a `by` argument that specifies the +variable(s) to join on. The `*_join()` functions are: -We will create a small dataset that contains information on: family -country of residence, cohort member’s ethnicity, whether any parent -reads to the child, the warmth of the relationship between the parent -and the child, family social class (National Statistics Socio-economic -Classification; NS-SEC), and mother’s highest education level. -Constructing and combining these variables involves restructing the data -in various ways. +1. `full_join()`: Returns all rows from `x` and `y`, and all columns + from `x` and `y`. For rows without matches in both `x` and `y`, the + missing value `NA` is used for columns that are not used as + identifiers. +2. `inner_join()`: Returns all rows from `x` and `y` where there are + matching rows in both data frames. +3. `left_join()`: Returns all rows from `x`, and all columns from `x` + and `y`. Rows in `x` with no match in `y` will have `NA` values in + the new columns from `y`. +4. `right_join()`: Returns all rows from `y`, and all columns from `x` + and `y`. Rows in `y` with no match in `x` will have `NA` values in + the columns of `x`. + +In the current context, where `x` is data from the second sweep +(`df_3y`) and `y` is data from the third sweep (`df_5y`): `full_join()` +will return a row for each individual present in the second or third +sweeps, with the height from each sweep in the same row; `inner_join()` +will return a row for each individual who was present in both sweeps, +with the height from each sweep in the same row; `left_join()` will +return a row for each individual in the second sweep, with the height +from the third sweep in the same row if the individual was present in +the third sweep; `right_join()` will return a row for each individual in +the third sweep, with the height from the second sweep in the same row +if the individual was present in the second sweep. + +The `*_join()` functions can handle multiple variables to join on, and +can also handle situations where the identifiers have different names +across `x` and `y`. To specify the identifiers, we pass a vector to the +`by` argument. In this case, we pass a *named vector* so that `BCNUM00` +in `df_3y` can be matched to `CCNUM00` in `df_5y`. + +```r +df_3y %>% + full_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00")) +``` -We will begin with the simplest variables: cohort member’s ethnicity and -family country of residence. Cohort member’s ethnicity is stored in a -cohort-member level dataset already (`mcs2_cm_derived`), so it does not -need further processing. +``` text +# A tibble: 17,242 × 4 + MCSID BCNUM00 BCHTCM00 CCHTCM00 + + 1 M10001N 1 [1st Cohort Member of the family] 97 114. + 2 M10002P 1 [1st Cohort Member of the family] 96 110. + 3 M10007U 1 [1st Cohort Member of the family] 102 118 + 4 M10008V 1 [1st Cohort Member of the family] -2 [No Measurement tak… NA + 5 M10008V 2 [2nd Cohort Member of the family] -2 [No Measurement tak… NA + 6 M10011Q 1 [1st Cohort Member of the family] 106 121 + 7 M10014T 1 [1st Cohort Member of the family] 97 NA + 8 M10015U 1 [1st Cohort Member of the family] 94 110. + 9 M10016V 1 [1st Cohort Member of the family] 102 118. +10 M10017W 1 [1st Cohort Member of the family] 99 110. +# ℹ 17,232 more rows +``` ```r -df_ethnic_group <- cm %>% - select(MCSID, BCNUM00, ethnic_group = BDC08E00) +df_3y %>% + inner_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00")) +``` -df_ethnic_group +``` text +# A tibble: 13,967 × 4 + MCSID BCNUM00 BCHTCM00 CCHTCM00 + + 1 M10001N 1 [1st Cohort Member of the family] 97 114. + 2 M10002P 1 [1st Cohort Member of the family] 96 110. + 3 M10007U 1 [1st Cohort Member of the family] 102 118 + 4 M10011Q 1 [1st Cohort Member of the family] 106 121 + 5 M10015U 1 [1st Cohort Member of the family] 94 110. + 6 M10016V 1 [1st Cohort Member of the family] 102 118. + 7 M10017W 1 [1st Cohort Member of the family] 99 110. + 8 M10018X 1 [1st Cohort Member of the family] 97 113. + 9 M10020R 1 [1st Cohort Member of the family] 97 112. +10 M10021S 1 [1st Cohort Member of the family] 90 108 +# ℹ 13,957 more rows +``` + +```r +df_3y %>% + left_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00")) ``` ``` text -# A tibble: 15,778 × 3 - MCSID BCNUM00 ethnic_group - - 1 M10001N 1 [1st Cohort Member of the family] 1 [White] - 2 M10002P 1 [1st Cohort Member of the family] 1 [White] - 3 M10007U 1 [1st Cohort Member of the family] 1 [White] - 4 M10008V 1 [1st Cohort Member of the family] 1 [White] - 5 M10008V 2 [2nd Cohort Member of the family] 1 [White] - 6 M10011Q 1 [1st Cohort Member of the family] 2 [Mixed] - 7 M10014T 1 [1st Cohort Member of the family] 1 [White] - 8 M10015U 1 [1st Cohort Member of the family] 1 [White] - 9 M10016V 1 [1st Cohort Member of the family] 1 [White] -10 M10017W 1 [1st Cohort Member of the family] -1 [Not applicable] +# A tibble: 15,778 × 4 + MCSID BCNUM00 BCHTCM00 CCHTCM00 + + 1 M10001N 1 [1st Cohort Member of the family] 97 114. + 2 M10002P 1 [1st Cohort Member of the family] 96 110. + 3 M10007U 1 [1st Cohort Member of the family] 102 118 + 4 M10008V 1 [1st Cohort Member of the family] -2 [No Measurement tak… NA + 5 M10008V 2 [2nd Cohort Member of the family] -2 [No Measurement tak… NA + 6 M10011Q 1 [1st Cohort Member of the family] 106 121 + 7 M10014T 1 [1st Cohort Member of the family] 97 NA + 8 M10015U 1 [1st Cohort Member of the family] 94 110. + 9 M10016V 1 [1st Cohort Member of the family] 102 118. +10 M10017W 1 [1st Cohort Member of the family] 99 110. +# ℹ 15,768 more rows +``` + +```r +df_3y %>% + right_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00")) +``` + +``` text +# A tibble: 15,431 × 4 + MCSID BCNUM00 BCHTCM00 CCHTCM00 + + 1 M10001N 1 [1st Cohort Member of the family] 97 114. + 2 M10002P 1 [1st Cohort Member of the family] 96 110. + 3 M10007U 1 [1st Cohort Member of the family] 102 118 + 4 M10011Q 1 [1st Cohort Member of the family] 106 121 + 5 M10015U 1 [1st Cohort Member of the family] 94 110. + 6 M10016V 1 [1st Cohort Member of the family] 102 118. + 7 M10017W 1 [1st Cohort Member of the family] 99 110. + 8 M10018X 1 [1st Cohort Member of the family] 97 113. + 9 M10020R 1 [1st Cohort Member of the family] 97 112. +10 M10021S 1 [1st Cohort Member of the family] 90 108 +# ℹ 15,421 more rows +``` + +Note, the `*_join()` functions will merge any matching rows. Unlike +`Stata`, we do not have to explicitly state whether we want a 1-to-1, +many-to-1, 1-to-many, or many-to-many merge. This is determined by the +data that are inputted to `*_join()`. `R` will not throw an error if +matches are not 1-to-1, so care must be taken, for instance, when +merging the different data structures. See [*Combining Data Within a +Sweep*](https://cls-data.github.io/docs/mcs-merging_within_sweep.html) +for more information. + +# Appending Sweeps + +To put the data into long format, we can use the `bind_rows()` function. +(In this case, the data will have one row per cohort-member x sweep +combination.) To work properly, we need to name the variables +consistently across sweeps, which here means removing the sweep-specific +prefixes (e.g., the letter `B` from `BCNUM00` and `BCHTCM00` in +`df_3y`). We also need to add a variable to identify the sweep the data +comes from. Below, we use the `mutate()` function to create a `sweep` +variable and then use the `rename_with()` function to remove the +prefixes and rename the variables consistently across sweeps. + +```r +df_3y_noprefix <- df_3y %>% + mutate(sweep = 2, .before = 1) %>% + rename_with(~ str_remove(.x, "^B")) + +df_5y_noprefix <- df_5y %>% + mutate(sweep = 3, .before = 1) %>% + rename_with(~ str_remove(.x, "^C")) +``` + +`rename_with()` applies a function to the names of the variables. In +this case, we use the `str_remove()` function from the `stringr` package +(part of the `tidyverse`) to remove the prefix from the variable names. +The `~` symbol is used to create an [*anonymous +function*](https://r4ds.hadley.nz/iteration.html), which is applied to +each variable name. The `.x` symbol in the anonymous function is a +placeholder for the variable name. `str_remove()` takes a regular +expression. The `^` symbol is used to match the start of the string (so +`^C` removes the `C` where it is the first character in a variable +name - necessary to avoid removing the `C` within, e.g., `MCSID`). Note, +for the `mutate()` call, the `.before` argument is used to specify the +position of the new variable in the data frame - here we specify `sweep` +as the first column. Below we see what the formatted data frames look +like: + +```r +df_3y_noprefix +``` + +``` text +# A tibble: 15,778 × 4 + sweep MCSID CNUM00 CHTCM00 + + 1 2 M10001N 1 [1st Cohort Member of the family] 97 + 2 2 M10002P 1 [1st Cohort Member of the family] 96 + 3 2 M10007U 1 [1st Cohort Member of the family] 102 + 4 2 M10008V 1 [1st Cohort Member of the family] -2 [No Measurement taken] + 5 2 M10008V 2 [2nd Cohort Member of the family] -2 [No Measurement taken] + 6 2 M10011Q 1 [1st Cohort Member of the family] 106 + 7 2 M10014T 1 [1st Cohort Member of the family] 97 + 8 2 M10015U 1 [1st Cohort Member of the family] 94 + 9 2 M10016V 1 [1st Cohort Member of the family] 102 +10 2 M10017W 1 [1st Cohort Member of the family] 99 # ℹ 15,768 more rows ``` -Family country of residence is stored in a family-level dataset -(`mcs2_family_derived`). This also does not need any further processing -at this stage. Later when we merge this with `df_ethnic_group`, we will -perform a 1-to-many merge, so the data will be automatically repeated -for cases where there are multiple cohort members in a family. - -```r -df_country <- family %>% - select(MCSID, country = BACTRY00) -``` - -Next, we will create a variable that indicates whether *any* parent -reads to the cohort member We will use data from the -`mcs2_parent_cm_interview` dataset, which contains a variable for the -parent’s reading habit (`BPOFRE00`). We first create a binary variable -that indicates whether the parent reads to the cohort member at least -once a week, and then create a summary variable indicating whether any -(interviewed) parent reads (`max(parent_reads)`) using `summarise()` -with `group_by(MCSID, BCNUM00)` to ensure this is calculated per cohort -member. The result is a dataset with one row per cohort member with data -on whether any parent reads to them. - -```r -df_reads <- parent_cm %>% - select(MCSID, BPNUM00, BCNUM00, BPOFRE00) %>% - mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1, - between(BPOFRE00, 4, 6) ~ 0)) %>% - drop_na() %>% - group_by(MCSID, BCNUM00) %>% - summarise(parent_reads = max(parent_reads), - .groups = "drop") - -df_reads -``` - -``` text -# A tibble: 15,684 × 3 - MCSID BCNUM00 parent_reads - - 1 M10001N 1 [1st Cohort Member of the family] 1 - 2 M10002P 1 [1st Cohort Member of the family] 1 - 3 M10007U 1 [1st Cohort Member of the family] 1 - 4 M10008V 1 [1st Cohort Member of the family] 1 - 5 M10008V 2 [2nd Cohort Member of the family] 1 - 6 M10011Q 1 [1st Cohort Member of the family] 1 - 7 M10014T 1 [1st Cohort Member of the family] 1 - 8 M10015U 1 [1st Cohort Member of the family] 1 - 9 M10016V 1 [1st Cohort Member of the family] 1 -10 M10017W 1 [1st Cohort Member of the family] 1 -# ℹ 15,674 more rows -``` - -We next create two separate variables for whether the responding parent -has a warm relationship with the cohort member (`BPPIAW00`) again using -the `mcs2_parent_cm_interview` dataset. As the data have one row per -parent-cohort member combination, we need to create a variable -indicating which parent is which, and then reshape the warmth variable -from long to wide (one row per cohort member) using `pivot_wider()`. -Again, result is a dataset with one row per cohort member with data on -their relationship with each carer. - -```r -df_warm <- parent_cm %>% - select(MCSID, BCNUM00, BELIG00, BPPIAW00) %>% - mutate(variable = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"), - value = case_when(BPPIAW00 == 5 ~ 1, - between(BPPIAW00, 1, 6) ~ 0)) %>% - select(MCSID, BCNUM00, variable, value) %>% - pivot_wider(names_from = variable, values_from = value) -``` - -Next, we want to create a variable for family social class (NS-SEC) -using the `mcs2_parent_derived` dataset. This is a parent level dataset, -and we will use the parent’s NS-SEC (`BDD05S00`) and take the minimum -value for each family (lower values of `BDD05S00` indicate higher social -class). - -```r -df_nssec <- parent %>% - select(MCSID, BPNUM00, parent_nssec = BDD05S00) %>% - mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>% - drop_na() %>% - group_by(MCSID) %>% - summarise(family_nssec = min(parent_nssec)) -``` - -We will also create a variable for the mother’s highest education level -using the `mcs2_parent_derived` dataset. We will filter for mothers only -(`BHCREL00 == 7` \[Natural Parent\] and `BHPSEX00 == 2` \[Female\]) and -select the variable highest education level (`BDDNVQ00`). We will then -merge these two variables with the other variables we have created so -far. We use `right_join()`, which gives a row for every mother in the -dataset, regardless of whether they have education data (`right_join()` -fills variables with `NA` where not observed). - -```r -df_mother <- hhgrid %>% - select(MCSID, BPNUM00, BHCREL00, BHPSEX00) %>% - filter(between(BPNUM00, 1, 99), - BHCREL00 == 7, - BHPSEX00 == 2) %>% - distinct(MCSID, BPNUM00) %>% - add_count(MCSID) %>% - filter(n == 1) %>% - select(MCSID, BPNUM00) - -df_mother_edu <- parent %>% - select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) %>% - right_join(df_mother, by = c("MCSID", "BPNUM00")) %>% - select(-BPNUM00) -``` - -# Merging the Datasets - -Now we have cleaned each variable, we can merge them together. The -cleaned datasets are either at the family level (`df_country`, -`df_nssec`, `df_mother_edu`) or cohort member level (`df_ethnic_group`, -`df_reads`, `df_warm`). We begin with `df_ethnic_group` as this has all -the cohort members (participating at Sweep 2) in it, and then use -`left_join()` so these rows are kept (and no more are added). To merge -with a family-level dataset, we use `left_join(..., by = "MCSID")` as -`MCSID` is the unique identifier for each cohort member. For the cohort -member level datasets, we use -`left_join(..., by = c("MCSID", "BCNUM00"))` as the combination of -`MCSID` and `BCNUM00` uniquely identifies cohort members. - -```r -df_ethnic_group %>% - left_join(df_country, by = "MCSID") %>% - left_join(df_reads, by = c("MCSID", "BCNUM00")) %>% - left_join(df_warm, by = c("MCSID", "BCNUM00")) %>% - left_join(df_nssec, by = "MCSID") %>% - left_join(df_mother_edu, by = "MCSID") -``` - -``` text -# A tibble: 15,778 × 9 - MCSID BCNUM00 ethnic_group country parent_reads main_warm secondary_warm - - 1 M10001N 1 [1st Co… 1 [White] 2 [Wal… 1 NA NA - 2 M10002P 1 [1st Co… 1 [White] 2 [Wal… 1 1 1 - 3 M10007U 1 [1st Co… 1 [White] 2 [Wal… 1 0 1 - 4 M10008V 1 [1st Co… 1 [White] 1 [Eng… 1 1 1 - 5 M10008V 2 [2nd Co… 1 [White] 1 [Eng… 1 1 1 - 6 M10011Q 1 [1st Co… 2 [Mixed] 1 [Eng… 1 1 NA - 7 M10014T 1 [1st Co… 1 [White] 3 [Sco… 1 1 1 - 8 M10015U 1 [1st Co… 1 [White] 1 [Eng… 1 1 1 - 9 M10016V 1 [1st Co… 1 [White] 4 [Nor… 1 1 1 -10 M10017W 1 [1st Co… -1 [Not app… 1 [Eng… 1 NA NA +```r +df_5y_noprefix +``` + +``` text +# A tibble: 15,431 × 4 + sweep MCSID CNUM00 CHTCM00 + + 1 3 M10001N 1 [1st Cohort Member of the family] 114. + 2 3 M10002P 1 [1st Cohort Member of the family] 110. + 3 3 M10007U 1 [1st Cohort Member of the family] 118 + 4 3 M10011Q 1 [1st Cohort Member of the family] 121 + 5 3 M10015U 1 [1st Cohort Member of the family] 110. + 6 3 M10016V 1 [1st Cohort Member of the family] 118. + 7 3 M10017W 1 [1st Cohort Member of the family] 110. + 8 3 M10018X 1 [1st Cohort Member of the family] 113. + 9 3 M10020R 1 [1st Cohort Member of the family] 112. +10 3 M10021S 1 [1st Cohort Member of the family] 108 +# ℹ 15,421 more rows +``` + +Now the data have been prepared, we can use `bind_rows()` to append the +data frames together. This will stack the data frames on top of each +other, so the number of rows is equal to the sum of rows in the +individual datasets. The `bind_rows()` function can handle data frames +with different numbers of columns. Missing columns are filled with `NA` +values. + +```r +bind_rows(df_3y_noprefix, df_5y_noprefix) %>% + arrange(MCSID, CNUM00, sweep) # Sorts the dataset by ID and sweep +``` + +``` text +Warning: `..1$CHTCM00` and `..2$CHTCM00` have conflicting value labels. +ℹ Labels for these values will be taken from `..1$CHTCM00`. +✖ Values: -1 +``` + +``` text +# A tibble: 31,209 × 4 + sweep MCSID CNUM00 CHTCM00 + + 1 2 M10001N 1 [1st Cohort Member of the family] 97 + 2 3 M10001N 1 [1st Cohort Member of the family] 114. + 3 2 M10002P 1 [1st Cohort Member of the family] 96 + 4 3 M10002P 1 [1st Cohort Member of the family] 110. + 5 2 M10007U 1 [1st Cohort Member of the family] 102 + 6 3 M10007U 1 [1st Cohort Member of the family] 118 + 7 2 M10008V 1 [1st Cohort Member of the family] -2 [No Measurement taken] + 8 2 M10008V 2 [2nd Cohort Member of the family] -2 [No Measurement taken] + 9 2 M10011Q 1 [1st Cohort Member of the family] 106 +10 3 M10011Q 1 [1st Cohort Member of the family] 121 +# ℹ 31,199 more rows +``` + +Notice that with `bind_rows()` a cohort member has only as many rows of +data as the times they appeared in Sweeps 2 and 3. This differs from +`*_join()` where an explicit missing `NA` value is generated for the +missing sweep. The `tidyverse` function `complete()` [can be used to +create missing +rows](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit), +which can be useful if you need to generate a balanced panel of +observations from which to begin analysis with (e.g., when performing +multiple imputation in long format). + +```r +bind_rows(df_3y_noprefix, df_5y_noprefix) %>% + complete(sweep, MCSID, CNUM00) %>% # Ensure cohort members have a row for each sweep + arrange(MCSID, CNUM00, sweep) +``` + +``` text +Warning: `..1$CHTCM00` and `..2$CHTCM00` have conflicting value labels. +ℹ Labels for these values will be taken from `..1$CHTCM00`. +✖ Values: -1 +``` + +``` text +# A tibble: 68,092 × 4 + sweep MCSID CNUM00 CHTCM00 + + 1 2 M10001N 1 [1st Cohort Member of the family] 97 + 2 3 M10001N 1 [1st Cohort Member of the family] 114. + 3 2 M10001N 2 [2nd Cohort Member of the family] NA + 4 3 M10001N 2 [2nd Cohort Member of the family] NA + 5 2 M10002P 1 [1st Cohort Member of the family] 96 + 6 3 M10002P 1 [1st Cohort Member of the family] 110. + 7 2 M10002P 2 [2nd Cohort Member of the family] NA + 8 3 M10002P 2 [2nd Cohort Member of the family] NA + 9 2 M10007U 1 [1st Cohort Member of the family] 102 +10 3 M10007U 1 [1st Cohort Member of the family] 118 +# ℹ 68,082 more rows +``` + +# Combing Sweeps Programatically + +Combining sweeps manually can become tedious when more than two sweeps +need to be combined Instead, [iterative +programming](https://r4ds.hadley.nz/iteration) can be used automate the +process. Below we show how to merge and append multiple sweeps together +with very little code using the `purrr` package (part of the +`tidyverse`). + +## Merging Programmatically + +Before merging the datasets together, we need to load the data for each +sweep. We can do this by creating a function, `load_height_wide()`, +which takes a single argument `sweep` and loads the height data for that +sweep. The function uses the `glue()` function from the `glue` package +to create the file path. We create and subset a vector of follow-up ages +(`fups`) to identify the correct folder to obtain the +`mcs{sweep}_cm_interview.dta` file from. The `glue()` function is used +to create strings from `R` objects. The curly braces (`{}`) act as +placeholders for variables or function calls that are computed when the +string is evaluated - e.g., when `sweep = 1`, +`{fup}y/mcs{sweep}_cm_interview.dta` = `0y/mcs1_cm_interview.dta`. +(`fup` is determined by subsetting the relevant element in the vectors +`fups`.) `glue` is part of the `tidyverse`, but is not a *core* package, +so needs to be loaded explicitly. + +The file path is piped to the `read_dta()` function to read in the data, +with the `col_select` argument used to keep only the variables we need. +Note we use a regular expression to select the `CNUM00` and height +variables as these have slightly different names each sweep. Typically +variable names only differ on the sweep prefix used (`ACHTM00`, +`BCHTM00`), but in Sweep 5 (age 11y), the name of the height variable +(`ECHTCMA00`) diverged slightly from this pattern. Below, we also +include a step to `rename()` the `[B-G]CNUM00` variable to `cnum` to +ensure consistency across sweeps as this will make merging more +straightforward later. + +```r +library(glue) +fups <- c(0, 3, 5, 7, 11, 14, 17) + +load_height_wide <- function(sweep){ + fup <- fups[sweep] + prefix <- LETTERS[sweep] + + glue("{fup}y/mcs{sweep}_cm_interview.dta") %>% + read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>% + rename(cnum = matches("CNUM00")) +} +``` + +To confirm the function is working correctly, below we call it twice to +load data from the second and third sweeps. + +```r +load_height_wide(2) +``` + +``` text +# A tibble: 15,778 × 3 + MCSID cnum BCHTCM00 + + 1 M10001N 1 [1st Cohort Member of the family] 97 + 2 M10002P 1 [1st Cohort Member of the family] 96 + 3 M10007U 1 [1st Cohort Member of the family] 102 + 4 M10008V 1 [1st Cohort Member of the family] -2 [No Measurement taken] + 5 M10008V 2 [2nd Cohort Member of the family] -2 [No Measurement taken] + 6 M10011Q 1 [1st Cohort Member of the family] 106 + 7 M10014T 1 [1st Cohort Member of the family] 97 + 8 M10015U 1 [1st Cohort Member of the family] 94 + 9 M10016V 1 [1st Cohort Member of the family] 102 +10 M10017W 1 [1st Cohort Member of the family] 99 # ℹ 15,768 more rows -# ℹ 2 more variables: family_nssec , mother_nvq +``` + +```r +load_height_wide(3) +``` + +``` text +# A tibble: 15,431 × 3 + MCSID cnum CCHTCM00 + + 1 M10001N 1 [1st Cohort Member of the family] 114. + 2 M10002P 1 [1st Cohort Member of the family] 110. + 3 M10007U 1 [1st Cohort Member of the family] 118 + 4 M10011Q 1 [1st Cohort Member of the family] 121 + 5 M10015U 1 [1st Cohort Member of the family] 110. + 6 M10016V 1 [1st Cohort Member of the family] 118. + 7 M10017W 1 [1st Cohort Member of the family] 110. + 8 M10018X 1 [1st Cohort Member of the family] 113. + 9 M10020R 1 [1st Cohort Member of the family] 112. +10 M10021S 1 [1st Cohort Member of the family] 108 +# ℹ 15,421 more rows +``` + +We could manually load and merge successively using multiple +`load_height_wide()` and `full_join()` function calls. However, this is +rather verbose: + +```r +load_height_wide(2) %>% + full_join(load_height_wide(3), by = c("MCSID", "cnum")) %>% + full_join(load_height_wide(4), by = c("MCSID", "cnum")) %>% + full_join(load_height_wide(6), by = c("MCSID", "cnum")) %>% + full_join(load_height_wide(7), by = c("MCSID", "cnum")) +``` + +``` text +# A tibble: 17,568 × 7 + MCSID cnum BCHTCM00 CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00 + + 1 M10001N 1 [1st Cohort Member o… 97 114. 128. NA NA + 2 M10002P 1 [1st Cohort Member o… 96 110. 123 163. 174. + 3 M10007U 1 [1st Cohort Member o… 102 118 129 174. 181. + 4 M10008V 1 [1st Cohort Member o… -2 [No … NA NA NA NA + 5 M10008V 2 [2nd Cohort Member o… -2 [No … NA NA NA NA + 6 M10011Q 1 [1st Cohort Member o… 106 121 137 NA NA + 7 M10014T 1 [1st Cohort Member o… 97 NA NA NA NA + 8 M10015U 1 [1st Cohort Member o… 94 110. 122. 164. 169 + 9 M10016V 1 [1st Cohort Member o… 102 118. 130 167 185. +10 M10017W 1 [1st Cohort Member o… 99 110. 121. NA NA +# ℹ 17,558 more rows +``` + +More efficiently, we can use the `map()` function from the `purrr` +package (part of the `tidyverse`) to apply the `load_height_wide()` +function to each sweep in turn. The `map()` function takes an object to +be looped over as its first argument and a function to apply as its +second argument. The function can be written as an anonymous function, +similar to `rename_with()`. `.x` is a placeholder for the current +elements of the object being looped over (in this case: `2`, then `3`, +then `4`, …, then `7`). The `map()` function returns the results as a +`list`. (Variants of `map()` return other data types, as we see +shortly). Below we use `map()` to run `load_height_wide()` for sweeps +2-7. To save space, we do not print the output. + +```r +map(2:7, ~ load_height_wide(.x)) +``` + +To merge list of datasets returned by `map()` together, we can use the +`reduce()` function from `purrr`. `reduce()` has a similar syntax to +`map()`: it takes an object as its first argument, and a function as its +second argument. It applies the function to the first *two* elements of +the list, and then successively applies the function to the result and +the third element of the list onwards, until the list is finished. +Below, we use `reduce()` to apply the `full_join()` function to the list +of data frames. We specify `full_join()` in an anonymous function. `.x` +and `.y` are the first and second inputs, respectively. In this case, at +the first iteration sweep 2 (`.x`) is merged with sweep 3 (`.y`), and at +the second iteration, the result of the first iteration (`.x`) is merged +with sweep 4 (`.y`). This process is repeated until sweep 7 has been +merged in. + +```r +map(2:7, load_height_wide) %>% + reduce(~ full_join(.x, .y, by = c("MCSID", "cnum"))) +``` + +``` text +# A tibble: 17,614 × 8 + MCSID cnum BCHTCM00 CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00 + + 1 M10001N 1 [1st Cohort… 97 114. 128. NA NA NA + 2 M10002P 1 [1st Cohort… 96 110. 123 144. 163. 174. + 3 M10007U 1 [1st Cohort… 102 118 129 154. 174. 181. + 4 M10008V 1 [1st Cohort… -2 [No … NA NA NA NA NA + 5 M10008V 2 [2nd Cohort… -2 [No … NA NA NA NA NA + 6 M10011Q 1 [1st Cohort… 106 121 137 168. NA NA + 7 M10014T 1 [1st Cohort… 97 NA NA NA NA NA + 8 M10015U 1 [1st Cohort… 94 110. 122. 143 164. 169 + 9 M10016V 1 [1st Cohort… 102 118. 130 152. 167 185. +10 M10017W 1 [1st Cohort… 99 110. 121. NA NA NA +# ℹ 17,604 more rows +``` + +## Appending Programmatically + +Programatically appending datasets together is slightly more +straightforward as we can use a variant of `map()` called `map_dfr()`. +Instead of returning a list, `map_dfr()` returns a data frame by calling +`bind_rows()` in the background at the end. First, we create a function, +`load_height_long()`, to load the height data from a given sweep and +format it so that it can be appended to the other sweeps (i.e., giving +variables consistent names). As above, the `rename_with()` function +renames the variables to remove the sweep-specific prefixes. The +relevant prefix is determined by subsetting the inbuilt `LETTERS` +vectors, which contains the letters of the alphabet in upper case +(`"A"`, `"B"`, `"C"`, …, `"Z"`; i.e., `LETTERS[2]` returns `"B"`). + +```r +load_height_long <- function(sweep){ + fup <- fups[sweep] + prefix <- LETTERS[sweep] + + glue("{fup}y/mcs{sweep}_cm_interview.dta") %>% + read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>% + rename_with(~ str_replace(.x, glue("^{prefix}"), "")) %>% + mutate(sweep = !!sweep, .before = 1) +} +``` + +To load data from sweeps 2-7 and append them together, we can use +`map_dfr()` with the `load_height_long()` function. + +```r +map_dfr(2:7, ~ load_height_long(.x)) +``` + +``` text +Warning: `..1$CHTCM00` and `..2$CHTCM00` have conflicting value labels. +ℹ Labels for these values will be taken from `..1$CHTCM00`. +✖ Values: -1 +``` + +``` text +Warning: `..1$CHTCM00` and `..3$CHTCM00` have conflicting value labels. +ℹ Labels for these values will be taken from `..1$CHTCM00`. +✖ Values: -8 and -1 +``` + +``` text +Warning: `..1$CHTCM00` and `..5$CHTCM00` have conflicting value labels. +ℹ Labels for these values will be taken from `..1$CHTCM00`. +✖ Values: -1 +``` + +``` text +Warning: `..1$CHTCM00` and `..6$CHTCM00` have conflicting value labels. +ℹ Labels for these values will be taken from `..1$CHTCM00`. +✖ Values: -5 and -1 +``` + +``` text +# A tibble: 80,873 × 5 + sweep MCSID CNUM00 CHTCM00 CHTCMA0 + + 1 2 M10001N 1 [1st Cohort Member of the family] 97 NA + 2 2 M10002P 1 [1st Cohort Member of the family] 96 NA + 3 2 M10007U 1 [1st Cohort Member of the family] 102 NA + 4 2 M10008V 1 [1st Cohort Member of the family] -2 [No Measuremen… NA + 5 2 M10008V 2 [2nd Cohort Member of the family] -2 [No Measuremen… NA + 6 2 M10011Q 1 [1st Cohort Member of the family] 106 NA + 7 2 M10014T 1 [1st Cohort Member of the family] 97 NA + 8 2 M10015U 1 [1st Cohort Member of the family] 94 NA + 9 2 M10016V 1 [1st Cohort Member of the family] 102 NA +10 2 M10017W 1 [1st Cohort Member of the family] 99 NA +# ℹ 80,863 more rows +``` + +# Coda: Merging Parent Level Files + +As discussed in the [Data Structures +page](https://cls-data.github.io/docs/mcs-data_structures.html), the +`mcs[1-7]_parent_*.dta` files contain identifiers for the respondent +(`MCSID` and `[A-G]PNUM00`), but also for the type of interview they +completed (`MCSID` and `[A-G]ELIG00`). We can use either of these to +merge parent-level datasets together across sweeps. When doing so, it is +sometimes worth keeping the information on the other identifiers to +retain information on the respondent or interview type; for instance, +this may help to determine why a response the same survey item differed +markedly between sweeps. + +```r +df_parent_5y <- read_dta("5y/mcs3_parent_cm_interview.dta", + col_select = c("MCSID", "CCNUM00", "CPNUM00", "CELIG00", "CPFRTP00")) + +df_parent_7y <- read_dta("7y/mcs4_parent_cm_interview.dta", + col_select = c("MCSID", "DCNUM00", "DPNUM00", "DELIG00", "DPFRTP00")) + +df_parent_5y %>% + full_join(df_parent_7y, + by = c("MCSID", + "CCNUM00" = "DCNUM00", + "CPNUM00" = "DPNUM00")) # Merge by person +``` + +``` text +# A tibble: 27,861 × 7 + MCSID CPNUM00 CELIG00 CCNUM00 CPFRTP00 DELIG00 DPFRTP00 + + 1 M10001N 1 1 [Main Interview] 1 [1st Coh… 2 [Two] 1 [Mai… 2 [Two] + 2 M10002P 1 1 [Main Interview] 1 [1st Coh… 3 [Thr… 1 [Mai… 3 [Thre… + 3 M10002P 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 3 [Thre… + 4 M10007U 1 1 [Main Interview] 1 [1st Coh… 3 [Thr… 1 [Mai… 3 [Thre… + 5 M10007U 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 3 [Thre… + 6 M10011Q 1 1 [Main Interview] 1 [1st Coh… 3 [Thr… 1 [Mai… 3 [Thre… + 7 M10011Q 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 3 [Thre… + 8 M10015U 1 1 [Main Interview] 1 [1st Coh… 2 [Two] 1 [Mai… 1 [One] + 9 M10015U 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 [Par… 1 [One] +10 M10016V 1 1 [Main Interview] 1 [1st Coh… 2 [Two] 1 [Mai… 2 [Two] +# ℹ 27,851 more rows +``` + +```r +df_parent_5y %>% + full_join(df_parent_7y, + by = c("MCSID", + "CCNUM00" = "DCNUM00", + "CELIG00" = "DELIG00")) # Merge by interview type +``` + +``` text +# A tibble: 27,770 × 7 + MCSID CPNUM00 CELIG00 CCNUM00 CPFRTP00 DPNUM00 DPFRTP00 + + 1 M10001N 1 1 [Main Interview] 1 [1st Coh… 2 [Two] 1 2 [Two] + 2 M10002P 1 1 [Main Interview] 1 [1st Coh… 3 [Thr… 1 3 [Thre… + 3 M10002P 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 3 [Thre… + 4 M10007U 1 1 [Main Interview] 1 [1st Coh… 3 [Thr… 1 3 [Thre… + 5 M10007U 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 3 [Thre… + 6 M10011Q 1 1 [Main Interview] 1 [1st Coh… 3 [Thr… 1 3 [Thre… + 7 M10011Q 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 3 [Thre… + 8 M10015U 1 1 [Main Interview] 1 [1st Coh… 2 [Two] 1 1 [One] + 9 M10015U 2 2 [Partner Interview] 1 [1st Coh… -1 [Not… 2 1 [One] +10 M10016V 1 1 [Main Interview] 1 [1st Coh… 2 [Two] 1 2 [Two] +# ℹ 27,760 more rows ``` diff --git a/docs/mcs-merging_within_sweep.md b/docs/mcs-merging_within_sweep.md index c201ac7..14f9a82 100644 --- a/docs/mcs-merging_within_sweep.md +++ b/docs/mcs-merging_within_sweep.md @@ -1,7 +1,7 @@ --- layout: default -title: Combining Data Across Sweeps -nav_order: 4 +title: Combining Data Within a Sweep +nav_order: 5 parent: MCS format: docusaurus-md --- @@ -11,10 +11,14 @@ format: docusaurus-md # Introduction -In this tutorial, we will learn how to merge, collapse and reshape -various data structures from a given sweep of the Millennium Cohort -Study (MCS) to create a dataset at the cohort member level (one row per -cohort member). We will use the following packages: +In this section, we show how to merge, collapse and reshape the various +data structures within a given sweep of the Millennium Cohort Study +(MCS) to create a dataset at the cohort member-level (i.e., one row per +cohort member). This is likely to be the most useful data structure for +most analyses, but similar principles can be applied to create other +structures as needed (e.g., family-level datasets). + +We use the following packages: ```r # Load Packages @@ -22,7 +26,18 @@ library(tidyverse) # For data manipulation library(haven) # For importing .dta files ``` -The datasets we will use are: +# Data Cleaning + +We create a small dataset that contains information from Sweep 2 on: +family country of residence, cohort member’s ethnicity, whether any +parent reads to the child, the warmth of the relationship between the +parent and the child, family social class (National Statistics +Socio-economic Classification; NS-SEC), and mother’s highest education +level. Constructing and combining these variables involves restructing +the data in various ways, and spans the most common data engineering +tasks involved in bringing together information from a single sweep. + +The datasets we use in this tutorial are: ```r family <- read_dta("3y/mcs2_family_derived.dta") # One row per family @@ -32,24 +47,16 @@ parent_cm <- read_dta("3y/mcs2_parent_cm_interview.dta") # One row per parent (r hhgrid <- read_dta("3y/mcs2_hhgrid.dta") # One row per household member ``` -# Data Cleaning - -We will create a small dataset that contains information on: family -country of residence, cohort member’s ethnicity, whether any parent -reads to the child, the warmth of the relationship between the parent -and the child, family social class (National Statistics Socio-economic -Classification; NS-SEC), and mother’s highest education level. -Constructing and combining these variables involves restructing the data -in various ways. - -We will begin with the simplest variables: cohort member’s ethnicity and +We begin with the simplest variables: cohort member’s ethnicity and family country of residence. Cohort member’s ethnicity is stored in a cohort-member level dataset already (`mcs2_cm_derived`), so it does not -need further processing. +need further processing. Below we rename the relevant variables and +select it along with the cohort member identifiers, `MCSID` and +`BCNUM00`. ```r df_ethnic_group <- cm %>% - select(MCSID, BCNUM00, ethnic_group = BDC08E00) + select(MCSID, BCNUM00, ethnic_group = BDC08E00) # Retains the listed variables, renaming BDC08E00 as ethnic_group df_ethnic_group ``` @@ -73,8 +80,8 @@ df_ethnic_group Family country of residence is stored in a family-level dataset (`mcs2_family_derived`). This also does not need any further processing -at this stage. Later when we merge this with `df_ethnic_group`, we will -perform a 1-to-many merge, so the data will be automatically repeated +at this stage. Later when we merging this data with `df_ethnic_group`, +we perform a 1-to-many merge, so the data will be automatically repeated for cases where there are multiple cohort members in a family. ```r @@ -82,26 +89,29 @@ df_country <- family %>% select(MCSID, country = BACTRY00) ``` -Next, we will create a variable that indicates whether *any* parent -reads to the cohort member We will use data from the -`mcs2_parent_cm_interview` dataset, which contains a variable for the -parent’s reading habit (`BPOFRE00`). We first create a binary variable -that indicates whether the parent reads to the cohort member at least -once a week, and then create a summary variable indicating whether any -(interviewed) parent reads (`max(parent_reads)`) using `summarise()` -with `group_by(MCSID, BCNUM00)` to ensure this is calculated per cohort +Next, we create a variable that indicates whether *any* parent reads to +the cohort member; in other words, we create a summary variable using +data from individual parents. The `mcs2_parent_cm_interview` dataset +contains a variable for the parent’s reading habit to a given child +(`BPOFRE00`). We first create a binary variable that indicates whether +the parent reads to the cohort member at least once a week, and then +create a summary variable indicating whether any (interviewed) parent +reads (`max(parent_reads)`) by collapsing the data using `summarise()` +on a [grouped data +frame](https://r4ds.hadley.nz/data-transform.html#groups) +(`group_by(MCSID, BCNUM00)`) to ensure this is calculated per cohort member. The result is a dataset with one row per cohort member with data -on whether any parent reads to them. +on whether any parent reads to them.[^1] ```r df_reads <- parent_cm %>% select(MCSID, BPNUM00, BCNUM00, BPOFRE00) %>% - mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1, + mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1, # Create binary variable for reading habit between(BPOFRE00, 4, 6) ~ 0)) %>% - drop_na() %>% - group_by(MCSID, BCNUM00) %>% - summarise(parent_reads = max(parent_reads), - .groups = "drop") + drop_na() %>% # Drops rows with any missing value (ensures we get a value where at least 1 parent gave a valid response). + group_by(MCSID, BCNUM00) %>% # Groups the data so summarise() is performed per cohort member. + summarise(parent_reads = max(parent_reads), # Calculates maximum value per cohort member + .groups = "drop") # Removes the grouping from the resulting dataframe. df_reads ``` @@ -123,48 +133,59 @@ df_reads # ℹ 15,674 more rows ``` -We next create two separate variables for whether the responding parent -has a warm relationship with the cohort member (`BPPIAW00`) again using -the `mcs2_parent_cm_interview` dataset. As the data have one row per -parent-cohort member combination, we need to create a variable -indicating which parent is which, and then reshape the warmth variable -from long to wide (one row per cohort member) using `pivot_wider()`. -Again, result is a dataset with one row per cohort member with data on -their relationship with each carer. +We next show a different way of using `mcs2_parent_cm_interview`, +reshaping the data from long to wide so that we have one row per cohort +member. As an example, we create separate variables for whether the +responding parent has a warm relationship with the cohort member +(`BPPIAW00`), one using responses from the main carer and one from the +secondary carer. As `mcs2_parent_cm_interview` have one row per parent x +cohort member combination, we first create a variable indicating which +parent is which (using information from `BELIG00`), and then reshape the +warmth variable from [long to wide using +`pivot_wider()`](https://r4ds.hadley.nz/data-tidy.html#widening-data). ```r df_warm <- parent_cm %>% select(MCSID, BCNUM00, BELIG00, BPPIAW00) %>% - mutate(variable = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"), - value = case_when(BPPIAW00 == 5 ~ 1, + mutate(var_name = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"), + warmth = case_when(BPPIAW00 == 5 ~ 1, between(BPPIAW00, 1, 6) ~ 0)) %>% - select(MCSID, BCNUM00, variable, value) %>% - pivot_wider(names_from = variable, values_from = value) + select(MCSID, BCNUM00, var_name, warmth) %>% + pivot_wider(names_from = var_name, # The new variables are named main_warm and secondary_warm... + values_from = warmth) # ... and take values from "warmth" ``` -Next, we want to create a variable for family social class (NS-SEC) -using the `mcs2_parent_derived` dataset. This is a parent level dataset, -and we will use the parent’s NS-SEC (`BDD05S00`) and take the minimum -value for each family (lower values of `BDD05S00` indicate higher social -class). +Next, we show an example of creating family level data using data from +individual parents; in this case, a variable for family social class +(NS-SEC) using data from `mcs2_parent_derived`. As `mcs2_parent_derived` +a parent level dataset, we take the minimum of parents’ NS-SEC +(`BDD05S00`) within a family (lower values of `BDD05S00` indicate higher +social class). ```r df_nssec <- parent %>% select(MCSID, BPNUM00, parent_nssec = BDD05S00) %>% - mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>% + mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>% # Negative values denote various forms of missingness. drop_na() %>% group_by(MCSID) %>% summarise(family_nssec = min(parent_nssec)) ``` -We will also create a variable for the mother’s highest education level -using the `mcs2_parent_derived` dataset. We will filter for mothers only -(`BHCREL00 == 7` \[Natural Parent\] and `BHPSEX00 == 2` \[Female\]) and -select the variable highest education level (`BDDNVQ00`). We will then -merge these two variables with the other variables we have created so -far. We use `right_join()`, which gives a row for every mother in the -dataset, regardless of whether they have education data (`right_join()` -fills variables with `NA` where not observed). +Finally, we create a variable for the mother’s highest education level +using the `mcs2_parent_derived` dataset. This involves merging in +relationship information from the household grid and subsetting the rows +so we are left with data for mothers only (see [*Working with the +Household +Grid*](https://cls-data.github.io/docs/mcs-household_grid.html). We +separately filter the household grid for mothers only (`BHCREL00 == 7` +\[Natural Parent\] and `BHPSEX00 == 2` \[Female\]) and select the +highest education level variable (`BDDNVQ00`) from the +`mcs2_parent_derived` dataset. We then merge the datasets together use +`right_join()`, which in this case, gives a row for every mother in the +dataset, regardless of whether they have education data or not +(`right_join()` fills variables with `NA` where [the retained row does +not have a +match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^2] ```r df_mother <- hhgrid %>% @@ -177,8 +198,10 @@ df_mother <- hhgrid %>% filter(n == 1) %>% select(MCSID, BPNUM00) -df_mother_edu <- parent %>% - select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) %>% +df_parent_edu <- parent %>% + select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) + +df_mother_edu <- df_parent_edu %>% right_join(df_mother, by = c("MCSID", "BPNUM00")) %>% select(-BPNUM00) ``` @@ -189,11 +212,11 @@ Now we have cleaned each variable, we can merge them together. The cleaned datasets are either at the family level (`df_country`, `df_nssec`, `df_mother_edu`) or cohort member level (`df_ethnic_group`, `df_reads`, `df_warm`). We begin with `df_ethnic_group` as this has all -the cohort members (participating at Sweep 2) in it, and then use -`left_join()` so these rows are kept (and no more are added). To merge -with a family-level dataset, we use `left_join(..., by = "MCSID")` as -`MCSID` is the unique identifier for each cohort member. For the cohort -member level datasets, we use +the cohort members participating at Sweep 2 in it. We then use +`left_join()` to merge in other data so original rows are kept (and no +more are added). To merge with a family-level dataset, we use +`left_join(..., by = "MCSID")` as `MCSID` is the unique identifier for +each cohort member. For the cohort member level datasets, we use `left_join(..., by = c("MCSID", "BCNUM00"))` as the combination of `MCSID` and `BCNUM00` uniquely identifies cohort members. @@ -223,3 +246,17 @@ df_ethnic_group %>% # ℹ 15,768 more rows # ℹ 2 more variables: family_nssec , mother_nvq ``` + +# Footnotes + +[^1]: Below, for simplicity, we drop any rows with missing values + (`drop_na()` step). Proper analyses may opt to use a different rule, + which may require merging in other information (e.g., setting the + value to missing unless all resident parents have been interviewed + and provided a valid response). + +[^2]: More detail on merging with `right_join()` (and other `*_join()` + variants) is provided in [*Combining Data Across + Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html), + as well as [Chapter 19 of the R for Data Science + textbook](https://r4ds.hadley.nz/joins.html#sec-mutating-joins). diff --git a/quarto/mcs-merging_across_sweeps.qmd b/quarto/mcs-merging_across_sweeps.qmd index fc0c3f4..2722650 100644 --- a/quarto/mcs-merging_across_sweeps.qmd +++ b/quarto/mcs-merging_across_sweeps.qmd @@ -1,6 +1,6 @@ --- layout: default -title: "Combining Data *Across* Sweeps" +title: "Combining Data Across Sweeps" nav_order: 4 parent: MCS format: docusaurus-md @@ -8,11 +8,11 @@ format: docusaurus-md # Introduction -In this section, we show how to combine MCS data across sweeps, assuming the data to be merged are in a consistent format (e.g., one row per family); for information on munging data to have a consistent structure see the page [*Combining Data Within a Sweep*](https://cls-data.github.io/docs/mcs-merging_within_sweep.html) +In this section, we show how to combine MCS data across sweeps, assuming the data to be merged are in a consistent format (e.g., one row per family); for information on munging data to have a consistent structure see the page [*Combining Data Within a Sweep*](https://cls-data.github.io/docs/mcs-merging_within_sweep.html) -To demonstrate, we will use data on cohort members' height, which was recorded in Sweeps 2-7 and is available in the `mcs[2-7]_cm_interview.dta` files. These files contain one row per cohort-member. As a reminder, we have organised the data files so that each sweep [has its own folder, which is named according to the age of follow-up](https://cls-data.github.io/docs/mcs-sweep_folders.html) (e.g., 3y for the second sweep). +As an example, we use data on cohort members' height, which was recorded in Sweeps 2-7 and is available in the `mcs[2-7]_cm_interview.dta` files. These files contain one row per cohort-member. As a reminder, we have organised the data files so that each sweep [has its own folder, which is named according to the age of follow-up](https://cls-data.github.io/docs/mcs-sweep_folders.html) (e.g., 3y for the second sweep). -We begin by combining data from the second and third sweeps, showing how to combine these datasets in **wide** (one row per observational unit) and **long** (multiple rows per observational unit) formats by *merging* and *appending*, respectively. We will show how to combine data from multiple sweeps *programmatically* using the `dplyr` and `purrr` packages (from the `tidyverse`). +We begin by combining data from the second and third sweeps, showing how to combine these datasets in **wide** (one row per observational unit) and **long** (multiple rows per observational unit) formats by *merging* and *appending*, respectively. We then show how to combine data from multiple sweeps *programmatically* using the `dplyr` and `purrr` packages (from the `tidyverse`). We use the following packages: @@ -30,7 +30,7 @@ library(haven) # For importing .dta files # Merging Across Sweeps -The variable `[B-G]CHTCM00` contains the height of the cohort member at each sweep, except for Sweep 5, where the variable is called `ECHTCMA0`. [The cohort-member identifiers are stored across two variables](https://cls-data.github.io/docs/mcs-data_structures.html) in the `mcs[2-7]_cm_interview.dta` files: `MCSID` and `[A-G]CNUM00`. `MCSID` is the family identifier and `[A-G]CNUM00` identifies the cohort member within the family. We will use the `read_dta()` function from `haven` to read in the data from the second and third sweeps, specifying the `col_select` argument to keep only the variables we need. +The variable `[B-G]CHTCM00` contains the height of the cohort member at Sweep 2-7, except for Sweep 5, where the variable is called `ECHTCMA0`. [The cohort-member identifiers are stored across two variables](https://cls-data.github.io/docs/mcs-data_structures.html) in the `mcs[2-7]_cm_interview.dta` files: `MCSID` and `[A-G]CNUM00`. `MCSID` is the family identifier and `[A-G]CNUM00` identifies the cohort member within the family. We will use the `read_dta()` function from `haven` to read in the data from the second and third sweeps, specifying the `col_select` argument to keep only the variables we need (the two identifier variables and height). ```{r} df_3y <- read_dta("3y/mcs2_cm_interview.dta", @@ -40,7 +40,7 @@ df_5y <- read_dta("5y/mcs3_cm_interview.dta", col_select = c("MCSID", "CCNUM00", "CCHTCM00")) ``` -We can merge the data using the `*_join()` family of functions. These share a common syntax. They take two data frames (`x` and `y`) as arguments, as well as a `by` argument that specifies the variable(s) to join on. The `*_join()` functions are: +We can merge these datasets by row using the `*_join()` family of functions. These share a common syntax. They take two data frames (`x` and `y`) as arguments, as well as a `by` argument that specifies the variable(s) to join on. The `*_join()` functions are: 1. `full_join()`: Returns all rows from `x` and `y`, and all columns from `x` and `y`. For rows without matches in both `x` and `y`, the missing value `NA` is used for columns that are not used as identifiers. 2. `inner_join()`: Returns all rows from `x` and `y` where there are matching rows in both data frames. @@ -65,41 +65,53 @@ df_3y %>% right_join(df_5y, by = c("MCSID", BCNUM00 = "CCNUM00")) ``` +Note, the `*_join()` functions will merge any matching rows. Unlike `Stata`, we do not have to explicitly state whether we want a 1-to-1, many-to-1, 1-to-many, or many-to-many merge. This is determined by the data that are inputted to `*_join()`. `R` will not throw an error if matches are not 1-to-1, so care must be taken, for instance, when merging the different data structures. See [*Combining Data Within a Sweep*](https://cls-data.github.io/docs/mcs-merging_within_sweep.html) for more information. + # Appending Sweeps -To put the data into long format, we can use the `bind_rows()` function. To work properly, we need to name the variables consistently across sweeps, which in this case means removing the sweep-specific prefixes (i.e., the letter `B` from `df_3y` and the letter `C` from `df_5y`). We also need to add a variable to identify the sweep the data comes from. Below, we use the `mutate()` function to create a `sweep` variable and then use the `rename_with()` function to remove the prefixes and rename the variables consistently across sweeps. +To put the data into long format, we can use the `bind_rows()` function. (In this case, the data will have one row per cohort-member x sweep combination.) To work properly, we need to name the variables consistently across sweeps, which here means removing the sweep-specific prefixes (e.g., the letter `B` from `BCNUM00` and `BCHTCM00` in `df_3y`). We also need to add a variable to identify the sweep the data comes from. Below, we use the `mutate()` function to create a `sweep` variable and then use the `rename_with()` function to remove the prefixes and rename the variables consistently across sweeps. ```{r} -df_3y_nopre <- df_3y %>% +df_3y_noprefix <- df_3y %>% mutate(sweep = 2, .before = 1) %>% rename_with(~ str_remove(.x, "^B")) -df_5y_nopre <- df_5y %>% +df_5y_noprefix <- df_5y %>% mutate(sweep = 3, .before = 1) %>% rename_with(~ str_remove(.x, "^C")) ``` -`rename_with()` applies a function to the names of the variables. In this case, we use the `str_remove()` function from the `stringr` package (part of the `tidyverse`) to remove the prefix from the variable names. The `~` symbol is used to create an *anonymous function*, which is applied to each variable name. The `.x` symbol in the anonymous function is a placeholder for the variable name. `str_remove()` takes a regular expression. The `^` symbol is used to match the start of the string (so `^C` removes the `C` where it is the first character in a variable name - this is necessary to avoid removing the `C` within, e.g., `MCSID`). Note, for the `mutate()` call, the `.before` argument is used to specify the position of the new variable in the data frame - here we want it as the first column. Below we see what the formatted data frames look like: +`rename_with()` applies a function to the names of the variables. In this case, we use the `str_remove()` function from the `stringr` package (part of the `tidyverse`) to remove the prefix from the variable names. The `~` symbol is used to create an [*anonymous function*](https://r4ds.hadley.nz/iteration.html), which is applied to each variable name. The `.x` symbol in the anonymous function is a placeholder for the variable name. `str_remove()` takes a regular expression. The `^` symbol is used to match the start of the string (so `^C` removes the `C` where it is the first character in a variable name - necessary to avoid removing the `C` within, e.g., `MCSID`). Note, for the `mutate()` call, the `.before` argument is used to specify the position of the new variable in the data frame - here we specify `sweep` as the first column. Below we see what the formatted data frames look like: ```{r} -df_3y_nopre -df_5y_nopre +df_3y_noprefix +df_5y_noprefix ``` Now the data have been prepared, we can use `bind_rows()` to append the data frames together. This will stack the data frames on top of each other, so the number of rows is equal to the sum of rows in the individual datasets. The `bind_rows()` function can handle data frames with different numbers of columns. Missing columns are filled with `NA` values. ```{r} -bind_rows(df_3y_nopre, df_5y_nopre) +bind_rows(df_3y_noprefix, df_5y_noprefix) %>% + arrange(MCSID, CNUM00, sweep) # Sorts the dataset by ID and sweep +``` + +Notice that with `bind_rows()` a cohort member has only as many rows of data as the times they appeared in Sweeps 2 and 3. This differs from `*_join()` where an explicit missing `NA` value is generated for the missing sweep. The `tidyverse` function `complete()` [can be used to create missing rows](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit), which can be useful if you need to generate a balanced panel of observations from which to begin analysis with (e.g., when performing multiple imputation in long format). + +```{r} +bind_rows(df_3y_noprefix, df_5y_noprefix) %>% + complete(sweep, MCSID, CNUM00) %>% # Ensure cohort members have a row for each sweep + arrange(MCSID, CNUM00, sweep) ``` # Combing Sweeps Programatically -Combining sweeps manually can become tedious when you need to combine more than two sweeps together. Instead, [iterative programming](https://r4ds.hadley.nz/iteration) can be used automate the process. Below we show how to merge and append multiple sweeps together with very little code using the `purrr` package (part of the `tidyverse`). +Combining sweeps manually can become tedious when more than two sweeps need to be combined Instead, [iterative programming](https://r4ds.hadley.nz/iteration) can be used automate the process. Below we show how to merge and append multiple sweeps together with very little code using the `purrr` package (part of the `tidyverse`). ## Merging Programmatically -Before merging the datasets together, we need to load the data for each sweep. We can do this by creating a function, `load_height_wide()`, which takes a single argument `sweep` and loads the height data for that sweep. The function uses the `glue()` function from the `glue` package to create the file path. We create and subset a vector of follow-up ages (`fups`) to identify the correct folder to obtain the `mcs{sweep}_cm_interview.dta` file from. The `glue()` function is used to create strings from `R` objects. The curly braces (`{}`) act as placeholders for variables or function calls that are computed when the string is evaluated - e.g., when `sweep = 1`, `{fup}y/mcs{sweep}_cm_interview.dta` = `0y/mcs1_cm_interview.dta`. (`fup` is determined by subsetting the relevant element in the vectors `fups`.) `glue` is part of the `tidyverse`, but is not a *core* package, so needs to be loaded explicitly. -The file path is fed to the `read_dta()` function from the `haven` package to read in the data, with the `col_select` argument used to keep only the variables we need. Note we use a regular expression to select the `CNUM` and height variables as these have slightly different names each sweep. Typically variable names only differ on the sweep prefix used (`ACHTM00`, `BCHTM00`), but in Sweep 5 (age 11y), the name of the height variable (`ECHTCMA00`) diverges slightly from this pattern. Below, we also include a step to `rename()` the `[B-G]CNUM00` variable to `cnum` to ensure consistency across sweeps as this will make merging more straightforward later. +Before merging the datasets together, we need to load the data for each sweep. We can do this by creating a function, `load_height_wide()`, which takes a single argument `sweep` and loads the height data for that sweep. The function uses the `glue()` function from the `glue` package to create the file path. We create and subset a vector of follow-up ages (`fups`) to identify the correct folder to obtain the `mcs{sweep}_cm_interview.dta` file from. The `glue()` function is used to create strings from `R` objects. The curly braces (`{}`) act as placeholders for variables or function calls that are computed when the string is evaluated - e.g., when `sweep = 1`, `{fup}y/mcs{sweep}_cm_interview.dta` = `0y/mcs1_cm_interview.dta`. (`fup` is determined by subsetting the relevant element in the vectors `fups`.) `glue` is part of the `tidyverse`, but is not a *core* package, so needs to be loaded explicitly. + +The file path is piped to the `read_dta()` function to read in the data, with the `col_select` argument used to keep only the variables we need. Note we use a regular expression to select the `CNUM00` and height variables as these have slightly different names each sweep. Typically variable names only differ on the sweep prefix used (`ACHTM00`, `BCHTM00`), but in Sweep 5 (age 11y), the name of the height variable (`ECHTCMA00`) diverged slightly from this pattern. Below, we also include a step to `rename()` the `[B-G]CNUM00` variable to `cnum` to ensure consistency across sweeps as this will make merging more straightforward later. ```{r} library(glue) @@ -115,14 +127,14 @@ load_height_wide <- function(sweep){ } ``` -To confirm the function is working correctly, let's use it to load the data the second and third sweeps. +To confirm the function is working correctly, below we call it twice to load data from the second and third sweeps. ```{r} load_height_wide(2) load_height_wide(3) ``` -Now, we could manually load and merge successively using multiple `load_height_wide()` and `full_join()` function calls. However, this is rather verbose. +We could manually load and merge successively using multiple `load_height_wide()` and `full_join()` function calls. However, this is rather verbose: ```{r} load_height_wide(2) %>% @@ -132,14 +144,14 @@ load_height_wide(2) %>% full_join(load_height_wide(7), by = c("MCSID", "cnum")) ``` -More efficiently, we can use the `map()` function from the `purrr` package (part of the `tidyverse`) to apply the `load_height_wide()` function to each sweep in turn. The `map()` function takes an object to be looped over as its first argument and a function to apply as its second argument. The function can be written as an anonymous function, similar to `rename_with()`. `.x` is a placeholder for the current elements of the object being looped over. The `map()` function returns the results as a `list`. (Variants of `map()` return other data types, as we will see shortly). Below we use `map()` to run `load_height_wide()` for sweeps 2-7. To save space, we do not print the output. +More efficiently, we can use the `map()` function from the `purrr` package (part of the `tidyverse`) to apply the `load_height_wide()` function to each sweep in turn. The `map()` function takes an object to be looped over as its first argument and a function to apply as its second argument. The function can be written as an anonymous function, similar to `rename_with()`. `.x` is a placeholder for the current elements of the object being looped over (in this case: `2`, then `3`, then `4`, ..., then `7`). The `map()` function returns the results as a `list`. (Variants of `map()` return other data types, as we see shortly). Below we use `map()` to run `load_height_wide()` for sweeps 2-7. To save space, we do not print the output. ```{r} #| output: false map(2:7, ~ load_height_wide(.x)) ``` -To merge list of datasets returned by `map()` together, we can use the `reduce()` function from `purrr` package. `reduce()` has a similar syntax to `map()`: it takes an object as its first argument, and a function as its second argument. It applies the function to the first *two* elements of the list, and then progressively applies the function to the result and the next element of the list, until the list is finished. Below, we use `reduce()` to apply the `full_join()` function to the list of data frames. We specify `full_join()` in an anonymous function. `.x` and `.y` the first and second inputs, respectively. So, at the first iteration sweep 2 (`.x`) is merged with sweep 3 (`.y`), and at the second iteration, the result of the first iteration (`.x`) is merged with sweep 4 (`.y`). This is repeated until sweep 7 has been merged in. +To merge list of datasets returned by `map()` together, we can use the `reduce()` function from `purrr`. `reduce()` has a similar syntax to `map()`: it takes an object as its first argument, and a function as its second argument. It applies the function to the first *two* elements of the list, and then successively applies the function to the result and the third element of the list onwards, until the list is finished. Below, we use `reduce()` to apply the `full_join()` function to the list of data frames. We specify `full_join()` in an anonymous function. `.x` and `.y` are the first and second inputs, respectively. In this case, at the first iteration sweep 2 (`.x`) is merged with sweep 3 (`.y`), and at the second iteration, the result of the first iteration (`.x`) is merged with sweep 4 (`.y`). This process is repeated until sweep 7 has been merged in. ```{r} map(2:7, load_height_wide) %>% @@ -147,7 +159,8 @@ map(2:7, load_height_wide) %>% ``` ## Appending Programmatically -Programatically appending datasets together is slightly more straightforward as we can use a variant of `map()` called `map_dfr()` which instead of returning a list, returns a data frame by calling `bind_rows()` on the result in the background. First, we create a function, `load_height_long()`, to load the height data a given sweep, formatting it so that it can be appended to the other sweeps. The `rename_with()` function renames the variables to remove the sweep-specific prefixes. The relevant prefix is determined by subsetting the inbuilt `LETTERS` vectors, which contains the letters of the alphabet in upper case (`"A"`, `"B"`, `"C"`, ..., `"Z"`; i.e., `LETTERS[2]` returns `"B"`). + +Programatically appending datasets together is slightly more straightforward as we can use a variant of `map()` called `map_dfr()`. Instead of returning a list, `map_dfr()` returns a data frame by calling `bind_rows()` in the background at the end. First, we create a function, `load_height_long()`, to load the height data from a given sweep and format it so that it can be appended to the other sweeps (i.e., giving variables consistent names). As above, the `rename_with()` function renames the variables to remove the sweep-specific prefixes. The relevant prefix is determined by subsetting the inbuilt `LETTERS` vectors, which contains the letters of the alphabet in upper case (`"A"`, `"B"`, `"C"`, ..., `"Z"`; i.e., `LETTERS[2]` returns `"B"`). ```{r} load_height_long <- function(sweep){ @@ -161,16 +174,15 @@ load_height_long <- function(sweep){ } ``` - -To load data from sweeps 2-7 and append them together, we can use `map_dfr()` with the `load_height_long()` function. Note, if we just provide the name of the function to `map_dfr()` (and `map()`, `reduce()`, `rename_with()`, etc.), the current element of the object being looped over is inputted as the first argument to that function. (We could also have done this above, but anonymous functions are extremely useful when writing complex code and arguably clarify the action that is being done.) +To load data from sweeps 2-7 and append them together, we can use `map_dfr()` with the `load_height_long()` function. ```{r} -map_dfr(2:7, load_height_long) +map_dfr(2:7, ~ load_height_long(.x)) ``` # Coda: Merging Parent Level Files -As discussed in the [Data Structures page](https://cls-data.github.io/docs/mcs-data_structures.html), the `mcs[1-7]_parent_*.dta` files contain identifiers for the respondent (`MCSID` and `[A-G]PNUM00`), but also for the type of interview they completed (`MCSID` and `[A-G]ELIG00`). We can use either of these to merge parent-level datasets together across sweeps. When doing so, it is sometimes worth keep the information on the other identifiers to retain information on the respondent or interview; for instance, this may help to determine why a variable was missing for an individual in a particular sweep. +As discussed in the [Data Structures page](https://cls-data.github.io/docs/mcs-data_structures.html), the `mcs[1-7]_parent_*.dta` files contain identifiers for the respondent (`MCSID` and `[A-G]PNUM00`), but also for the type of interview they completed (`MCSID` and `[A-G]ELIG00`). We can use either of these to merge parent-level datasets together across sweeps. When doing so, it is sometimes worth keeping the information on the other identifiers to retain information on the respondent or interview type; for instance, this may help to determine why a response the same survey item differed markedly between sweeps. ```{r} df_parent_5y <- read_dta("5y/mcs3_parent_cm_interview.dta", @@ -190,4 +202,4 @@ df_parent_5y %>% by = c("MCSID", "CCNUM00" = "DCNUM00", "CELIG00" = "DELIG00")) # Merge by interview type -``` \ No newline at end of file +``` diff --git a/quarto/mcs-merging_within_sweep.qmd b/quarto/mcs-merging_within_sweep.qmd index bf24050..49951f3 100644 --- a/quarto/mcs-merging_within_sweep.qmd +++ b/quarto/mcs-merging_within_sweep.qmd @@ -1,14 +1,16 @@ --- layout: default -title: "Combining Data *Within* a Sweep" -nav_order: 4 +title: "Combining Data Within A Sweep" +nav_order: 5 parent: MCS format: docusaurus-md --- # Introduction -In this tutorial, we will learn how to merge, collapse and reshape various data structures from a given sweep of the Millennium Cohort Study (MCS) to create a dataset at the cohort member level (one row per cohort member). We will use the following packages: +In this section, we show how to merge, collapse and reshape the various data structures within a given sweep of the Millennium Cohort Study (MCS) to create a dataset at the cohort member-level (i.e., one row per cohort member). This is likely to be the most useful data structure for most analyses, but similar principles can be applied to create other structures as needed (e.g., family-level datasets). + +We use the following packages: ```{r} #| warning: false @@ -22,7 +24,11 @@ library(haven) # For importing .dta files # setwd(Sys.getenv("mcs_fld")) ``` -The datasets we will use are: +# Data Cleaning + +We create a small dataset that contains information from Sweep 2 on: family country of residence, cohort member's ethnicity, whether any parent reads to the child, the warmth of the relationship between the parent and the child, family social class (National Statistics Socio-economic Classification; NS-SEC), and mother's highest education level. Constructing and combining these variables involves restructing the data in various ways, and spans the most common data engineering tasks involved in bringing together information from a single sweep. + +The datasets we use in this tutorial are: ```{r} family <- read_dta("3y/mcs2_family_derived.dta") # One row per family @@ -32,67 +38,67 @@ parent_cm <- read_dta("3y/mcs2_parent_cm_interview.dta") # One row per parent (r hhgrid <- read_dta("3y/mcs2_hhgrid.dta") # One row per household member ``` -# Data Cleaning -We will create a small dataset that contains information on: family country of residence, cohort member's ethnicity, whether any parent reads to the child, the warmth of the relationship between the parent and the child, family social class (National Statistics Socio-economic Classification; NS-SEC), and mother's highest education level. Constructing and combining these variables involves restructing the data in various ways. - -We will begin with the simplest variables: cohort member's ethnicity and family country of residence. Cohort member's ethnicity is stored in a cohort-member level dataset already (`mcs2_cm_derived`), so it does not need further processing. +We begin with the simplest variables: cohort member's ethnicity and family country of residence. Cohort member's ethnicity is stored in a cohort-member level dataset already (`mcs2_cm_derived`), so it does not need further processing. Below we rename the relevant variables and select it along with the cohort member identifiers, `MCSID` and `BCNUM00`. ```{r} df_ethnic_group <- cm %>% - select(MCSID, BCNUM00, ethnic_group = BDC08E00) + select(MCSID, BCNUM00, ethnic_group = BDC08E00) # Retains the listed variables, renaming BDC08E00 as ethnic_group df_ethnic_group ``` - -Family country of residence is stored in a family-level dataset (`mcs2_family_derived`). This also does not need any further processing at this stage. Later when we merge this with `df_ethnic_group`, we will perform a 1-to-many merge, so the data will be automatically repeated for cases where there are multiple cohort members in a family. +Family country of residence is stored in a family-level dataset (`mcs2_family_derived`). This also does not need any further processing at this stage. Later when we merging this data with `df_ethnic_group`, we perform a 1-to-many merge, so the data will be automatically repeated for cases where there are multiple cohort members in a family. ```{r} df_country <- family %>% select(MCSID, country = BACTRY00) ``` -Next, we will create a variable that indicates whether *any* parent reads to the cohort member. We will use data from the `mcs2_parent_cm_interview` dataset, which contains a variable for the parent's reading habit to a given child (`BPOFRE00`). We first create a binary variable that indicates whether the parent reads to the cohort member at least once a week, and then create a summary variable indicating whether any (interviewed) parent reads (`max(parent_reads)`) using `summarise()` with `group_by(MCSID, BCNUM00)` to ensure this is calculated per cohort member. The result is a dataset with one row per cohort member with data on whether any parent reads to them. +Next, we create a variable that indicates whether *any* parent reads to the cohort member; in other words, we create a summary variable using data from individual parents. The `mcs2_parent_cm_interview` dataset contains a variable for the parent's reading habit to a given child (`BPOFRE00`). We first create a binary variable that indicates whether the parent reads to the cohort member at least once a week, and then create a summary variable indicating whether any (interviewed) parent reads (`max(parent_reads)`) by collapsing the data using `summarise()` on a [grouped data frame](https://r4ds.hadley.nz/data-transform.html#groups) (`group_by(MCSID, BCNUM00)`) to ensure this is calculated per cohort member. The result is a dataset with one row per cohort member with data on whether any parent reads to them.[^1] +[^1]: Below, for simplicity, we drop any rows with missing values (`drop_na()` step). Proper analyses may opt to use a different rule, which may require merging in other information (e.g., setting the value to missing unless all resident parents have been interviewed and provided a valid response). ```{r} df_reads <- parent_cm %>% select(MCSID, BPNUM00, BCNUM00, BPOFRE00) %>% - mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1, + mutate(parent_reads = case_when(between(BPOFRE00, 1, 3) ~ 1, # Create binary variable for reading habit between(BPOFRE00, 4, 6) ~ 0)) %>% - drop_na() %>% - group_by(MCSID, BCNUM00) %>% - summarise(parent_reads = max(parent_reads), - .groups = "drop") + drop_na() %>% # Drops rows with any missing value (ensures we get a value where at least 1 parent gave a valid response). + group_by(MCSID, BCNUM00) %>% # Groups the data so summarise() is performed per cohort member. + summarise(parent_reads = max(parent_reads), # Calculates maximum value per cohort member + .groups = "drop") # Removes the grouping from the resulting dataframe. df_reads ``` -We next create two separate variables for whether the responding parent has a warm relationship with the cohort member (`BPPIAW00`) again using the `mcs2_parent_cm_interview` dataset. As the data have one row per parent-cohort member combination, we need to create a variable indicating which parent is which, and then reshape the warmth variable from long to wide (one row per cohort member) using `pivot_wider()`. Again, result is a dataset with one row per cohort member with data on their relationship with each carer. +We next show a different way of using `mcs2_parent_cm_interview`, reshaping the data from long to wide so that we have one row per cohort member. As an example, we create separate variables for whether the responding parent has a warm relationship with the cohort member (`BPPIAW00`), one using responses from the main carer and one from the secondary carer. As `mcs2_parent_cm_interview` have one row per parent x cohort member combination, we first create a variable indicating which parent is which (using information from `BELIG00`), and then reshape the warmth variable from [long to wide using `pivot_wider()`](https://r4ds.hadley.nz/data-tidy.html#widening-data). ```{r} df_warm <- parent_cm %>% select(MCSID, BCNUM00, BELIG00, BPPIAW00) %>% - mutate(variable = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"), - value = case_when(BPPIAW00 == 5 ~ 1, + mutate(var_name = ifelse(BELIG00 == 1, "main_warm", "secondary_warm"), + warmth = case_when(BPPIAW00 == 5 ~ 1, between(BPPIAW00, 1, 6) ~ 0)) %>% - select(MCSID, BCNUM00, variable, value) %>% - pivot_wider(names_from = variable, values_from = value) + select(MCSID, BCNUM00, var_name, warmth) %>% + pivot_wider(names_from = var_name, # The new variables are named main_warm and secondary_warm... + values_from = warmth) # ... and take values from "warmth" ``` -Next, we want to create a variable for family social class (NS-SEC) using the `mcs2_parent_derived` dataset. This is a parent level dataset, and we will use the parent's NS-SEC (`BDD05S00`) and take the minimum value for each family (lower values of `BDD05S00` indicate higher social class). +Next, we show an example of creating family level data using data from individual parents; in this case, a variable for family social class (NS-SEC) using data from `mcs2_parent_derived`. As `mcs2_parent_derived` a parent level dataset, we take the minimum of parents' NS-SEC (`BDD05S00`) within a family (lower values of `BDD05S00` indicate higher social class). ```{r} df_nssec <- parent %>% select(MCSID, BPNUM00, parent_nssec = BDD05S00) %>% - mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>% + mutate(parent_nssec = if_else(parent_nssec < 0, NA, parent_nssec)) %>% # Negative values denote various forms of missingness. drop_na() %>% group_by(MCSID) %>% summarise(family_nssec = min(parent_nssec)) ``` -We will also create a variable for the mother's highest education level using the `mcs2_parent_derived` dataset. We will filter for mothers only (`BHCREL00 == 7` [Natural Parent] and `BHPSEX00 == 2` [Female]) and select the variable highest education level (`BDDNVQ00`). We will then merge these two variables with the other variables we have created so far. We use `right_join()`, which gives a row for every mother in the dataset, regardless of whether they have education data (`right_join()` fills variables with `NA` where not observed). +Finally, we create a variable for the mother's highest education level using the `mcs2_parent_derived` dataset. This involves merging in relationship information from the household grid and subsetting the rows so we are left with data for mothers only (see [*Working with the Household Grid*](https://cls-data.github.io/docs/mcs-household_grid.html). We separately filter the household grid for mothers only (`BHCREL00 == 7` \[Natural Parent\] and `BHPSEX00 == 2` \[Female\]) and select the highest education level variable (`BDDNVQ00`) from the `mcs2_parent_derived` dataset. We then merge the datasets together use `right_join()`, which in this case, gives a row for every mother in the dataset, regardless of whether they have education data or not (`right_join()` fills variables with `NA` where [the retained row does not have a match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^2] + +[^2]: More detail on merging with `right_join()` (and other `*_join()` variants) is provided in [*Combining Data Across Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html), as well as [Chapter 19 of the R for Data Science textbook](https://r4ds.hadley.nz/joins.html#sec-mutating-joins). ```{r} df_mother <- hhgrid %>% @@ -105,14 +111,17 @@ df_mother <- hhgrid %>% filter(n == 1) %>% select(MCSID, BPNUM00) -df_mother_edu <- parent %>% - select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) %>% +df_parent_edu <- parent %>% + select(MCSID, BPNUM00, mother_nvq = BDDNVQ00) + +df_mother_edu <- df_parent_edu %>% right_join(df_mother, by = c("MCSID", "BPNUM00")) %>% select(-BPNUM00) ``` # Merging the Datasets -Now we have cleaned each variable, we can merge them together. The cleaned datasets are either at the family level (`df_country`, `df_nssec`, `df_mother_edu`) or cohort member level (`df_ethnic_group`, `df_reads`, `df_warm`). We begin with `df_ethnic_group` as this has all the cohort members (participating at Sweep 2) in it, and then use `left_join()` so these rows are kept (and no more are added). To merge with a family-level dataset, we use `left_join(..., by = "MCSID")` as `MCSID` is the unique identifier for each cohort member. For the cohort member level datasets, we use `left_join(..., by = c("MCSID", "BCNUM00"))` as the combination of `MCSID` and `BCNUM00` uniquely identifies cohort members. + +Now we have cleaned each variable, we can merge them together. The cleaned datasets are either at the family level (`df_country`, `df_nssec`, `df_mother_edu`) or cohort member level (`df_ethnic_group`, `df_reads`, `df_warm`). We begin with `df_ethnic_group` as this has all the cohort members participating at Sweep 2 in it. We then use `left_join()` to merge in other data so original rows are kept (and no more are added). To merge with a family-level dataset, we use `left_join(..., by = "MCSID")` as `MCSID` is the unique identifier for each cohort member. For the cohort member level datasets, we use `left_join(..., by = c("MCSID", "BCNUM00"))` as the combination of `MCSID` and `BCNUM00` uniquely identifies cohort members. ```{r} df_ethnic_group %>% @@ -121,4 +130,6 @@ df_ethnic_group %>% left_join(df_warm, by = c("MCSID", "BCNUM00")) %>% left_join(df_nssec, by = "MCSID") %>% left_join(df_mother_edu, by = "MCSID") -``` \ No newline at end of file +``` + +# Footnotes \ No newline at end of file