Skip to content

Latest commit

 

History

History
413 lines (340 loc) · 20 KB

README.md

File metadata and controls

413 lines (340 loc) · 20 KB

OpenCaseStudies

Important Links

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

License

This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

Citation

To cite this case study please use:

Wright, Carrie and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-diet. Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).

Acknowledgments

We would like to acknowledge Jessica Fanzo for assisting in framing the major direction of the case study, as well as Ashkan Afshin and Erin Mullany for giving us access to the data.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Title

Exploring global patterns of dietary behaviors associated with health risk

Motivation

According to this article that evaluated food consumption patterns in 185 countries for 15 dietary risk factors with probable associations with non-communicable disease:

High intake of sodium …, low intake of whole grains …, and low intake of fruits … were the leading dietary risk factors for deaths and DALYs globally and in many countries.”

In this case study we evaluate the data used in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017. We particularly focus on dietary consumption patterns within the United States (US) and how these compare to other that of other countries.

Motivating questions

Our main questions:

  1. What are the global trends for potentially harmful diets?
  2. How do males and females compare?
  3. How do different age groups compare for these dietary factors?
  4. How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?

Data

In this case study we will be using data that we requested form the Global Burden of Disease (GBD) about consumption of dietary factors associated with health risks.

We will also be using data from a PDF of an article about the optimal consumption guidelines for these dietary factors.

Their methods for identifying and authenticating incidents are outlined here.

Previously according to their website:

“The database compiles information from more than 25 different sources including peer-reviewed studies, government reports, mainstream media, non-profits, private websites, blogs, and crowd-sourced lists that have been analyzed, filtered, deconflicted, and cross-referenced. All of the information is based on open-source information and 3rd party reporting… and may include reporting errors.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

  1. Importing/extracting data from PDF (dplyr, stringr)
  2. How to reshape data by pivoting between “long” and “wide” formats (tidyr)
  3. Perform functions on all columns of a tibble (purrr)
  4. Data cleaning with regular expressions (stringr)
  5. Specific data value reassignment
  6. Separate data within a column into multiple columns (tidyr)
  7. Methods to Compare data (dplyr)
  8. Combining data from two sources (dplyr)
  9. Make interactive plots (ggiraph)
  10. Make a zoom facet for plot (ggforce)
  11. Combine plots together (cowplot)

Statistical Learning Objectives:

  1. Understanding of how the t-test and the ANOVA are specialized regressions
  2. Basic understanding of the utility of a regression analysis
  3. How to implement a linear regression analysis in R
  4. How to interpret regression coefficients
  5. Awareness of t-test assumptions
  6. Awareness of linear regression assumptions
  7. How to use Q-Q plots to check for normality
  8. Difference between fixed effects and random effects
  9. How to perform paired t-test
  10. How to perform a linear mixed effects regression

Data import

In this case study we demonstrate how to import data from a csv and from a PDF.

Data wrangling

This case study also covers many of the stringr functions to manipulate character strings, including str_split(), str_subset(), str_replace(), str_replace_all(), str_which(), str_count(), str_remove_all(), and str_trim().

This case study also covers how to use the tidyr functions such as pivot_wider() and pivot_longer() for reshaping data and the separate() function for creating new columns from an existing column. In addition, the case study covers how to replace NA values with a specific value using the replace_na() function.

This case study also goes over how to use many of the dplyr functions to modify, select and filter data, such as: rename(), mutate(), arrange(), select() and filter() as well as functions to compare data like the setequal(), all_equal(), and setdiff() functions, as well as similar functions to look for overlapping similarities like the intersect() function. The case study describes the differences of these functions. We also introduce how to recode data using the if_else() and case_when() functions and how to join data using the full_join() function.

We also cover how to use the purrr package map() function to apply the same function to multiple columns in a tibble.

Data Visualization

In this case study we show how to make faceted plots, as well as plots with a facet that is zoomed in using the facet_zoom() function of the ggforce package. We cover how to specifically highlight specific data points, as well as how to add annotations and horizontal lines to make the plot more interpretable.

We also demonstrate how to make interactive plots where the data points link you to other websites using the ggiraph package. Finally, we demonstrate how to combine plots using the cowplot package.

We also cover how to use the viridis package to make plots that are more interpretable for those who are colorblind.

Analysis

This case study has a particularly thorough analysis section, which describes many ways of added complexity to examine the data. We describe how the t-test and the ANOVA are actually specialized forms of the regression analysis.

We provide an introduction to regression analysis.

We also describe paired data and how to interpret this using both a paired t-test and a linear model with fixed effects or a linear model with mixed effects. We also describe the difference between random and fixed effects.

See this other case study for more introductory material about comparing groups, hypothesis testing, probability, distributions, normality, paired data, and the paired t-test.

Other notes and resources

RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
RStudio projects

Tidyverse

Piping in R

String manipulation cheatsheet
Table formats

Helpful Links

Terms and concepts covered:

Interpunct
Regular expressions
Inference
Regression
Different types of regression
Ordinary least squares method
Residual
t-tests
ANOVA
t-tests and ANOVA are equivalent to regression also see here and here about how many commonly known statistical tests are specialized forms of regression
Normally Distribution
Q-Q plot
Guide to residual diagnostic plots and Examples
Residual vs fitted plot
Scale-location plot
Homoscedasticity
Heteroscedasticity
Interpreting lm() output
Coefficients
Linear mixed effects regression
Satterthwaite formula
Mood’s Two-Sample Scale Test
Standard deviation
Homogeneity of Variances assumption
polyunsaturated fatty acids

Tests of Homogeneity of Variance for 3 or more groups:

Bartlett’s test
Fligner-Killeen
Levene’s test

Other helpful links:

Long and Wide Data Formats
Distributions Skewed Distributions Bimodal Distribution ggplot2
Shapiro-Wilk Test
Paired Data
Welch’s t-test
Parametric and Nonparametric Methods
Variance
Balanced Study Design
Independent Observations
Transformation
Permutation/Resampling Methods
Central Limit Theorem Wilcoxon Signed Rank Test
Wilcoxon Rank Sum Test
Two-sample Kolmogorov-Smirnov Test
Type 1 Error
p-value
Multiple Testing
Bonferroni Method of Multiple Testing Correction

Packages used in this case study:

Package Use in this case study
here to easily load and save data
readr to import the CSV file data
dplyr to arrange/filter/select/compare specific subsets of the data
skimr to get an overview of data
pdftools to read a PDF into R
stringr to manipulate the text within the PDF of the data
magrittr to use the %<>% pipping operator
purrr to perform functions on all columns of a tibble
tibble to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr
tidyr to separate data within a column into multiple columns
ggplot2 to make visualizations with multiple layers
ggpubr to easily add regression line equations to plots
forcats to change details about factors (categorical variables)
lmerTest to perform linear mixed model testing
car to perform Levene’s Test of Homogeneity of Variances
ggiraph to make plots interactive
ggforce to modify facets in plots
viridis to plot in color palette
cowplot to allow plots to be combined

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file (README.md).

Users can skip the Data Import and Data Wrangling sections to start with the Data Analysis and Visualization section if they wish.

For instructors

Instructors can skip the Data Import and Data Wrangling sections and start with either the Data Exploration, Data Analysis, or Data Visualization sections if they wish.

Target audience

This case study is appropriate for those new to R programming. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some introductory knowledge of R programming, particularly for creating visualizations.

Suggested homework

Students can evaluate consumption estimates of another dietary factor besides red meat.