This repository contains the code to process raw data from the CD2 vocabularies survey for each participating cohort study into a much more readable and useful format. It also contains the results, but for privacy reasons (email addresses, IP addresses and demographic information), the raw survey data is not published here.
Below you will find what this survey is about and how the data are processed.
You can read the script in html-form here. The script itself is written in Rmd and can be found here
Table of contents
- About the CD2 project
- About the CD2 vocabularies survey
- Structure of the survey
- Structure of the datafile
- Functionality of this code
- Dependencies
- License
- Contributing and contact
Connecting Data in Child Development (CD2) is an infrastructure project funded by the Platform Digitale Infrastructuur - Social Sciences and Humanities (PDI-SSH). The project aims to harmonize the metadata from 6 Dutch developmental child cohort studies within the Consortium on Individual Development (CID). The end result of this harmonization is an online portal where one can find, among others, what data was collected in these cohort studies and how to get access to them (of note, the portal does not contain actual data). Additionally, the metadata underlying this web portal should be findable not only through the web portal itself, but also through existing infrastructures such as ODISSEI and HEALTH-RI. You can read the full project description here.
As part of making the CID metadata findable, the individual measures are labelled with keywords and categories. Because there is no fully suitable controlled vocabulary available that fits the wealth of data in CID, existing vocabularies are complemented with our own. To create such a vocabulary, input from the entire CID community was needed to help us determine the relevant keywords and categories that researchers would use to search for all the different types of data within CID. The CD2 vocabularies survey therefore asked respondents (researchers) to 1) provide keywords and 2) choose relevant categories for a subset of experiments within their own cohort.
The full survey can be found in CD2_vocabularies_survey_questions.pdf
in the
assets
folder of this repository. Importantly, depending on their cohort,
respondents answered 2 questions which were repeated for 25 measures of their
cohort (e.g., experiments, questionnaires, etc.) using Qualtrics's Loop and
Merge functionality:
Which keywords would you assign to this measure? Please separate your keywords with a comma
(open text question)Choose one or multiple categories that you think fit best and rank them according to their relevance using numbers (1 = most relevant, 2 = second most relevant, etc.).
(ranking question with 3 additional text fields in which custom categories could be provided)
The data file resulting from the Qualtrics survey is an extremely wide datafile, because for each of the 6 cohorts, there was a total of around 80-200 measures (dependent on the cohort) that could be shown. Although the survey for an individual respondent would only show a random selection of 25 of these, the resulting datafile contains them all and is > 13000 columns wide. Not exactly readable!
The datafile contains 3 main types of variables:
- Keywords question:
[instrument_number]_[cohort]_Keywords
: Keywords string response separated by commas. - Category rankings:
[instrument_number]_[cohort]_Cat_[category-number]
: The response is a number delineanating the priority given to the category. - Category custom categories:
[instrument_number]_[cohort]_Cat_1[2/3/4]_TEXT
: A string indicating the custom category that was provided.
The code can be found in the src
folder and does the following:
- Set the parameters (e.g., filename, additional intrument numbers files, cohort names, etc.).
- Read in the data.
- Create mappings: lists of numbers and instrument names.
- For each cohort, put the data from the Keywords question in a flat, usable format.
- For each cohort, put the data from the Category ranking question in a flat, usable format.
- For each cohort, combine the processed data from the Keywords and Category
ranking questions into one processed datafile. These can be found in
data/processed
.
The code is located in the CD2-vocabularies-survey-v1.3.Rmd
file located in
the src
folder. You need the following to run the code:
- The raw data (not included in this repository)
- The instrument number files as used during the survey (located in
assets/instrumentnrs
) - R, RStudio and git
- The R packages
rmarkdown
,data.table
,tidyverse
,wordcloud2
,webshot
, andhtmlwidgets
All relevant files are read in by the code. Because the code is R Markdown, you can find a lot of explanation about how the code works in there as well.
Feel free to reuse this code by cloning the repository. Warning: you will most likely have to adapt the code tremendously, since the code is currently tailored towards this specific use case.
This project is licensed under the terms of the MIT License
This repository is not actively maintained at the moment. However, if you see a bug, feel free to open an issue or a pull request in this repository. Alternatively, feel free to email me for comments or questions.