Edit MCS reshape long wide

CLS-Data · Sep 19, 2024 · b8da6be · b8da6be
1 parent 9932cec
commit b8da6be
Show file tree

Hide file tree

Showing 8 changed files with 237 additions and 193 deletions.
diff --git a/docs/mcs-data_structures.md b/docs/mcs-data_structures.md
@@ -72,11 +72,11 @@ The parent files have a similar structure to the cohort member-level (`mcs[1-7]_
 | M10005C | 1             | ... |
 | ...     | ...           | ... |
 
-Like `[A-G]CNUM00`, `[A-G]PNUM00` on its own is not a unique identifier. It lists the person number within the family, so it has to be combined with `MCSID` to identify a particular individual. Again, like `[A-G]CNUM00`, `[A-G]PNUM00` has a sweep-specific prefix, but take the same value across sweeps for a given individual (i.e., it is persistent).
+Like `[A-G]CNUM00`, `[A-G]PNUM00` on its own is not a unique identifier. It lists the person number within the family, so it has to be combined with `MCSID` to identify a particular individual. Again, like `[A-G]CNUM00`, `[A-G]PNUM00` has a sweep-specific prefix, but takes the same value across sweeps for a given individual (i.e., it is persistent).
 
 The value of `[A-G]PNUM00` is partly arbitrary. It does not specify a particular relationship to a cohort member. Such relationships are determined in the household grid files, which we discuss further below. The `[A-G]PNUM00` does follow a convention, however. For non-cohort members, `[A-G]PNUM00` is a positive integer between 1 and 99. For cohort members, `[A-G]PNUM00` is equal to `[A-G]CNUM00` multiplied by 100; i.e. for the first cohort member in a family it is 100, and for the second it is 200.[^3] While cohort members have a `[A-G]PNUM00`, non-cohort members (parents or other household members) do not get a `[A-G]CNUM00`.
 
-[^3]: An exception to this is in `mcs6_hhgrid.dta` where for all cohort members `FPNUM00 == -1 [Not applicable]`.
+[^3]: Exceptions to this are `mcs[6-7]_hhgrid.dta` where for all cohort members `[F-G]PNUM00 == -1 [Not applicable]`.
 
 Again, as two variables are required to uniquely identify a parent, you may prefer to create a single, unique identifier variable by concatenating `MCSID` and `[A-G]PNUM00`.
 

diff --git a/docs/mcs-merging_across_sweeps.md b/docs/mcs-merging_across_sweeps.md
@@ -380,7 +380,7 @@ variables as these have slightly different names each sweep. Typically
 variable names only differ on the sweep prefix used (`ACHTM00`,
 `BCHTM00`), but in Sweep 5 (age 11y), the name of the height variable
 (`ECHTCMA00`) diverged slightly from this pattern. Below, we also
-include a step to `rename()` the `[B-G]CNUM00` variable to `cnum` to
+include a step to `rename()` the `[B-G]CNUM00` variable to `CNUM00` to
 ensure consistency across sweeps as this will make merging more
 straightforward later.
 
@@ -394,7 +394,7 @@ load_height_wide <- function(sweep){
 
   glue("{fup}y/mcs{sweep}_cm_interview.dta") %>%
     read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>%
-    rename(cnum = matches("CNUM00"))
+    rename(CNUM00 = matches("CNUM00"))
 }
 ```
 
@@ -407,7 +407,7 @@ load_height_wide(2)
 
 ``` text
 # A tibble: 15,778 × 3
-   MCSID   cnum                                BCHTCM00                  
+   MCSID   CNUM00                              BCHTCM00                  
    <chr>   <dbl+lbl>                           <dbl+lbl>                 
  1 M10001N 1 [1st Cohort Member of the family]  97                       
  2 M10002P 1 [1st Cohort Member of the family]  96                       
@@ -428,7 +428,7 @@ load_height_wide(3)
 
 ``` text
 # A tibble: 15,431 × 3
-   MCSID   cnum                                CCHTCM00 
+   MCSID   CNUM00                              CCHTCM00 
    <chr>   <dbl+lbl>                           <dbl+lbl>
  1 M10001N 1 [1st Cohort Member of the family] 114.     
  2 M10002P 1 [1st Cohort Member of the family] 110.     
@@ -449,15 +449,15 @@ rather verbose:
 
 ```r
 load_height_wide(2) %>%
-  full_join(load_height_wide(3), by = c("MCSID", "cnum")) %>%
-  full_join(load_height_wide(4), by = c("MCSID", "cnum")) %>%
-  full_join(load_height_wide(6), by = c("MCSID", "cnum")) %>%
-  full_join(load_height_wide(7), by = c("MCSID", "cnum"))
+  full_join(load_height_wide(3), by = c("MCSID", "CNUM00")) %>%
+  full_join(load_height_wide(4), by = c("MCSID", "CNUM00")) %>%
+  full_join(load_height_wide(6), by = c("MCSID", "CNUM00")) %>%
+  full_join(load_height_wide(7), by = c("MCSID", "CNUM00"))
 ```
 
 ``` text
 # A tibble: 17,568 × 7
-   MCSID   cnum                    BCHTCM00  CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
+   MCSID   CNUM00                  BCHTCM00  CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
    <chr>   <dbl+lbl>               <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
  1 M10001N 1 [1st Cohort Member o…  97       114.     128.      NA       NA     
  2 M10002P 1 [1st Cohort Member o…  96       110.     123      163.     174.    
@@ -504,12 +504,12 @@ merged in.
 
 ```r
 map(2:7, load_height_wide) %>%
-  reduce(~ full_join(.x, .y, by = c("MCSID", "cnum")))
+  reduce(~ full_join(.x, .y, by = c("MCSID", "CNUM00")))
 ```
 
 ``` text
 # A tibble: 17,614 × 8
-   MCSID   cnum           BCHTCM00  CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
+   MCSID   CNUM00         BCHTCM00  CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
    <chr>   <dbl+lbl>      <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
  1 M10001N 1 [1st Cohort…  97       114.     128.      NA       NA       NA     
  2 M10002P 1 [1st Cohort…  96       110.     123      144.     163.     174.    

diff --git a/docs/mcs-merging_within_sweep.md b/docs/mcs-merging_within_sweep.md
@@ -1,6 +1,6 @@
 ---
 layout: default
-title: Combining Data Within a Sweep
+title: Combining Data Within A Sweep
 nav_order: 5
 parent: MCS
 format: docusaurus-md
@@ -82,7 +82,7 @@ Family country of residence is stored in a family-level dataset
 (`mcs2_family_derived`). This also does not need any further processing
 at this stage. Later when we merging this data with `df_ethnic_group`,
 we perform a 1-to-many merge, so the data will be automatically repeated
-for cases where there are multiple cohort members in a family.
+for cases where there are multiple cohort members in a family.[^1]
 
 ```r
 df_country <- family %>%
@@ -101,7 +101,7 @@ on a [grouped data
 frame](https://r4ds.hadley.nz/data-transform.html#groups)
 (`group_by(MCSID, BCNUM00)`) to ensure this is calculated per cohort
 member. The result is a dataset with one row per cohort member with data
-on whether any parent reads to them.[^1]
+on whether any parent reads to them.[^2]
 
 ```r
 df_reads <- parent_cm %>%
@@ -185,7 +185,7 @@ highest education level variable (`BDDNVQ00`) from the
 dataset, regardless of whether they have education data or not
 (`right_join()` fills variables with `NA` where [the retained row does
 not have a
-match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^2]
+match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^3]
 
 ```r
 df_mother <- hhgrid %>%
@@ -249,13 +249,21 @@ df_ethnic_group %>%
 
 # Footnotes
 
-[^1]: Below, for simplicity, we drop any rows with missing values
+[^1]: It is also possible to expand a family level dataset so that it
+    has as many rows as there are cohort-members in the family.
+    `mcs2_family_derived.dta` contains a variable, `BDNOCM00`, with this
+    information that can be used with the `tidyverse` function
+    `uncount(BDNOCM00)` to achieve this. (The dataset
+    `mcs_longitudinal_family_file` contains a variable `NOCMHH` which
+    holds similar information.)
+
+[^2]: Below, for simplicity, we drop any rows with missing values
     (`drop_na()` step). Proper analyses may opt to use a different rule,
     which may require merging in other information (e.g., setting the
     value to missing unless all resident parents have been interviewed
     and provided a valid response).
 
-[^2]: More detail on merging with `right_join()` (and other `*_join()`
+[^3]: More detail on merging with `right_join()` (and other `*_join()`
     variants) is provided in [*Combining Data Across
     Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html),
     as well as [Chapter 19 of the R for Data Science