Skip to content

Commit

Permalink
Edit MCS reshape long wide
Browse files Browse the repository at this point in the history
  • Loading branch information
ljwright committed Sep 19, 2024
1 parent 9932cec commit b8da6be
Show file tree
Hide file tree
Showing 8 changed files with 237 additions and 193 deletions.
4 changes: 2 additions & 2 deletions docs/mcs-data_structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,11 @@ The parent files have a similar structure to the cohort member-level (`mcs[1-7]_
| M10005C | 1 | ... |
| ... | ... | ... |

Like `[A-G]CNUM00`, `[A-G]PNUM00` on its own is not a unique identifier. It lists the person number within the family, so it has to be combined with `MCSID` to identify a particular individual. Again, like `[A-G]CNUM00`, `[A-G]PNUM00` has a sweep-specific prefix, but take the same value across sweeps for a given individual (i.e., it is persistent).
Like `[A-G]CNUM00`, `[A-G]PNUM00` on its own is not a unique identifier. It lists the person number within the family, so it has to be combined with `MCSID` to identify a particular individual. Again, like `[A-G]CNUM00`, `[A-G]PNUM00` has a sweep-specific prefix, but takes the same value across sweeps for a given individual (i.e., it is persistent).

The value of `[A-G]PNUM00` is partly arbitrary. It does not specify a particular relationship to a cohort member. Such relationships are determined in the household grid files, which we discuss further below. The `[A-G]PNUM00` does follow a convention, however. For non-cohort members, `[A-G]PNUM00` is a positive integer between 1 and 99. For cohort members, `[A-G]PNUM00` is equal to `[A-G]CNUM00` multiplied by 100; i.e. for the first cohort member in a family it is 100, and for the second it is 200.[^3] While cohort members have a `[A-G]PNUM00`, non-cohort members (parents or other household members) do not get a `[A-G]CNUM00`.

[^3]: An exception to this is in `mcs6_hhgrid.dta` where for all cohort members `FPNUM00 == -1 [Not applicable]`.
[^3]: Exceptions to this are `mcs[6-7]_hhgrid.dta` where for all cohort members `[F-G]PNUM00 == -1 [Not applicable]`.

Again, as two variables are required to uniquely identify a parent, you may prefer to create a single, unique identifier variable by concatenating `MCSID` and `[A-G]PNUM00`.

Expand Down
22 changes: 11 additions & 11 deletions docs/mcs-merging_across_sweeps.md
Original file line number Diff line number Diff line change
Expand Up @@ -380,7 +380,7 @@ variables as these have slightly different names each sweep. Typically
variable names only differ on the sweep prefix used (`ACHTM00`,
`BCHTM00`), but in Sweep 5 (age 11y), the name of the height variable
(`ECHTCMA00`) diverged slightly from this pattern. Below, we also
include a step to `rename()` the `[B-G]CNUM00` variable to `cnum` to
include a step to `rename()` the `[B-G]CNUM00` variable to `CNUM00` to
ensure consistency across sweeps as this will make merging more
straightforward later.

Expand All @@ -394,7 +394,7 @@ load_height_wide <- function(sweep){

glue("{fup}y/mcs{sweep}_cm_interview.dta") %>%
read_dta(col_select = c("MCSID", matches("^.(CNUM00|CHTCM(A|0)0)"))) %>%
rename(cnum = matches("CNUM00"))
rename(CNUM00 = matches("CNUM00"))
}
```

Expand All @@ -407,7 +407,7 @@ load_height_wide(2)

``` text
# A tibble: 15,778 × 3
MCSID cnum BCHTCM00
MCSID CNUM00 BCHTCM00
<chr> <dbl+lbl> <dbl+lbl>
1 M10001N 1 [1st Cohort Member of the family] 97
2 M10002P 1 [1st Cohort Member of the family] 96
Expand All @@ -428,7 +428,7 @@ load_height_wide(3)

``` text
# A tibble: 15,431 × 3
MCSID cnum CCHTCM00
MCSID CNUM00 CCHTCM00
<chr> <dbl+lbl> <dbl+lbl>
1 M10001N 1 [1st Cohort Member of the family] 114.
2 M10002P 1 [1st Cohort Member of the family] 110.
Expand All @@ -449,15 +449,15 @@ rather verbose:

```r
load_height_wide(2) %>%
full_join(load_height_wide(3), by = c("MCSID", "cnum")) %>%
full_join(load_height_wide(4), by = c("MCSID", "cnum")) %>%
full_join(load_height_wide(6), by = c("MCSID", "cnum")) %>%
full_join(load_height_wide(7), by = c("MCSID", "cnum"))
full_join(load_height_wide(3), by = c("MCSID", "CNUM00")) %>%
full_join(load_height_wide(4), by = c("MCSID", "CNUM00")) %>%
full_join(load_height_wide(6), by = c("MCSID", "CNUM00")) %>%
full_join(load_height_wide(7), by = c("MCSID", "CNUM00"))
```

``` text
# A tibble: 17,568 × 7
MCSID cnum BCHTCM00 CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
MCSID CNUM00 BCHTCM00 CCHTCM00 DCHTCM00 FCHTCM00 GCHTCM00
<chr> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
1 M10001N 1 [1st Cohort Member o… 97 114. 128. NA NA
2 M10002P 1 [1st Cohort Member o… 96 110. 123 163. 174.
Expand Down Expand Up @@ -504,12 +504,12 @@ merged in.

```r
map(2:7, load_height_wide) %>%
reduce(~ full_join(.x, .y, by = c("MCSID", "cnum")))
reduce(~ full_join(.x, .y, by = c("MCSID", "CNUM00")))
```

``` text
# A tibble: 17,614 × 8
MCSID cnum BCHTCM00 CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
MCSID CNUM00 BCHTCM00 CCHTCM00 DCHTCM00 ECHTCMA0 FCHTCM00 GCHTCM00
<chr> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
1 M10001N 1 [1st Cohort… 97 114. 128. NA NA NA
2 M10002P 1 [1st Cohort… 96 110. 123 144. 163. 174.
Expand Down
20 changes: 14 additions & 6 deletions docs/mcs-merging_within_sweep.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: default
title: Combining Data Within a Sweep
title: Combining Data Within A Sweep
nav_order: 5
parent: MCS
format: docusaurus-md
Expand Down Expand Up @@ -82,7 +82,7 @@ Family country of residence is stored in a family-level dataset
(`mcs2_family_derived`). This also does not need any further processing
at this stage. Later when we merging this data with `df_ethnic_group`,
we perform a 1-to-many merge, so the data will be automatically repeated
for cases where there are multiple cohort members in a family.
for cases where there are multiple cohort members in a family.[^1]

```r
df_country <- family %>%
Expand All @@ -101,7 +101,7 @@ on a [grouped data
frame](https://r4ds.hadley.nz/data-transform.html#groups)
(`group_by(MCSID, BCNUM00)`) to ensure this is calculated per cohort
member. The result is a dataset with one row per cohort member with data
on whether any parent reads to them.[^1]
on whether any parent reads to them.[^2]

```r
df_reads <- parent_cm %>%
Expand Down Expand Up @@ -185,7 +185,7 @@ highest education level variable (`BDDNVQ00`) from the
dataset, regardless of whether they have education data or not
(`right_join()` fills variables with `NA` where [the retained row does
not have a
match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^2]
match](https://r4ds.hadley.nz/missing-values.html#sec-missing-implicit)).[^3]

```r
df_mother <- hhgrid %>%
Expand Down Expand Up @@ -249,13 +249,21 @@ df_ethnic_group %>%

# Footnotes

[^1]: Below, for simplicity, we drop any rows with missing values
[^1]: It is also possible to expand a family level dataset so that it
has as many rows as there are cohort-members in the family.
`mcs2_family_derived.dta` contains a variable, `BDNOCM00`, with this
information that can be used with the `tidyverse` function
`uncount(BDNOCM00)` to achieve this. (The dataset
`mcs_longitudinal_family_file` contains a variable `NOCMHH` which
holds similar information.)

[^2]: Below, for simplicity, we drop any rows with missing values
(`drop_na()` step). Proper analyses may opt to use a different rule,
which may require merging in other information (e.g., setting the
value to missing unless all resident parents have been interviewed
and provided a valid response).

[^2]: More detail on merging with `right_join()` (and other `*_join()`
[^3]: More detail on merging with `right_join()` (and other `*_join()`
variants) is provided in [*Combining Data Across
Sweeps*](https://cls-data.github.io/docs/mcs-merging_across_sweeps.html),
as well as [Chapter 19 of the R for Data Science
Expand Down
Loading

0 comments on commit b8da6be

Please sign in to comment.