OpenCaseStudies

Important links

HTML: https://www.opencasestudies.org/ocs-bp-air-pollution/
GitHub: https://github.com/opencasestudies/ocs-bp-air-pollution/
Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies

Disclaimer

The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.

License

This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.

Citation

To cite this case study please use:

Wright, Carrie and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).

Acknowledgments

We would like to acknowledge Roger Peng, Megan Latshaw, and Kirsten Koehler for assisting in framing the major direction of the case study.

We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.

Title

Predicting Annual Air Pollution

Motivation

Machine learning methods have been used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.

We will use machine learning methods to predict annual air pollution levels spatially within the US based on data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data.

Motivating question

Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?

Data

The data that we will use in this case study come from a gravimetric air pollution monitor system operated by the US Enivornmental Protection Agency (EPA) that measures fine particulate matter (PM_2.5) in the United States (US). We will use data from 876 gravimetric monitors in in the contiguous US in 2008.

Roughly 90% of these monitors are located within cities.

Hence, there is an equity issue in terms of capturing the air pollution levels of more rural areas. To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate or predict air pollution levels in areas with little to no monitoring.

We will use data related to population density, urbanization, road density, as well as, NASA satellite pollution data and chemical modeling data to predict the monitoring values captured from this air pollution monitoring system.

The data for these 48 predictors comes from the US Enivornmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).

All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change.

Learning Objectives

The skills, methods, and concepts that students will be familiar with by the end of this case study are:

Data Science Learning Objectives:

Familiarity with the tidymodels ecosystem
Ability to evaluate correlation among predictor variables (corrplot and GGally)
Ability to implement tidymodels packages such as rsample to split the data into training and testing sets as well as cross validation sets.
Ability to use the recipes, parsnip, and workflows to train and test a linear regression model and random forest model
Demonstrate how to visualize geo-spatial data using ggplot2

Statistical Learning Objectives:

Basic understanding the utility of machine learning for prediction and classification
Understanding of the need for training and test sets
Understanding of the utility of cross validation
Understanding of random forest
How to interpret root mean squared error (rmse) to assess performance for prediction

Analysis

This case study focuses on machine learning methods. We demonstrate how to train and test a linear regression model and a random forest model.

Data import

The data is imported from a CSV file using the readr package.

Data wrangling

This case study does not demonstrate very many data wrangling methods. However we do cover the mutate() and across functions of the dplyr package in the Data wrangling section. In the Data visualzation, some wrangling was required including combining data using the inner_join() function of the dplyr package, using the separate function of the tidyr package to make two columns out of one, and the str_to_title() function of the stringr package to change the format of some character strings.

Data exploration

We demonstrate how to get a summary of a relatively large set of predictors using the skim package, as well as how to evaluate correlation among all variables using the corrplot package and among specific variables with more information using the GGally package.

Statistical concepts

We cover the basics of machine learning: 1) the difference between prediction and classification 2) the importance of training and testing 3) the concept of cross validation and tuning 4) how random forest works

Other notes and resources

A review of tidymodels
A course on tidymodels by Julia Silge
More examples, explanations, and info about tidymodels development from the developers
A guide for pre-processing with recipes
A guide for using GGally to create correlation plots
A guide for using parsnip to try different algorithms or engines
A list of recipe functions
A great blog post about cross validation
A discussion about evaluating model performance for a deeper explanation about how to evaluate model performance
RStudio cheatsheets
An explanation of supervised vs unsupervised machine learning and bias-variance trade-off.
A thorough explanation of principal component analysis.
If you have access, this is a great discussion about the difference between independence, orthogonality, and lack of correlation.
Great video explanation of PCA.

Terms and concepts covered:

Tidyverse
Imputation
Transformation
Discretization
Dummy Variables
One Hot Encoding
Data Type Conversions
Interaction
Normalization
Dimensionality Reduction/Signal Extraction
Row Operations
Near Zero Varaince
Parameters and Hyper-parameters
Supervised and Unspervised Learning
Principal Component Analysis
Linear Combinations
Decision Tree
Random Forest

Packages used in this case study:

Package	Use in this case study
here	to easily load and save data
readr	to import the CSV file data
dplyr	to view/arrange/filter/select/compare specific subsets of the data
skimr	to get an overview of data
summarytools	to get an overview of data in a different style
magrittr	to use the `%<>%` pipping operator
corrplot	to make large correlation plots
GGally	to make smaller correlation plots
rsample	to split the data into testing and training sets and to split the training set for cross-validation
recipes	to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are `recipe()` , `prep()` and various transformation `step_*()` functions, as well as `bake` which extracts pre-processed training data (used to require `juice()`) and applies recipe preprocessing steps to testing data). See here for more info.
parsnip	an interface to create models (major functions are `fit()`, `set_engine()`)
yardstick	to evaluate the performance of models
broom	to get tidy output for our model fit and performance
ggplot2	to make visualizations with multiple layers
dials	to specify hyper-parameter tuning
tune	to perform cross validation, tune hyper-parameters, and get performance metrics
workflows	to create modeling workflow to streamline the modeling process
vip	to create variable importance plots
randomForest	to perform the random forest analysis
doParallel	to fit cross validation samples in parallel
stringr	to manipulate the text the map data
tidyr	to separate data within a column into multiple columns
rnaturalearth	to get the geometry data for the earth to plot the US
maps	to get map database data about counties to draw them on our US map
sf	to convert the map data into a data frame
lwgeom	to use the `sf` function to convert the map geographical data
rgeos	to use geometry data
patchwork	to allow plots to be combined

For users

There is a Makefile in this folder that allows you to type make to knit the case study contained in the index.Rmd to index.html and it will also knit the README.Rmd to a markdown file (README.md).

For instructors

This case study is intended to introduce fundamental topics in Machine Learning and to introduce how to implement model prediction using the tidymodels ecosystem of packages in R.

Target audience

This case study is intended for those with some familiarity with linear regression and R programming.

Suggested homework

Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
data		data
docs		docs
img		img
tmp		tmp
.gitignore		.gitignore
GA_Script.Rhtml		GA_Script.Rhtml
Makefile		Makefile
README.Rmd		README.Rmd
README.md		README.md
index.Rmd		index.Rmd
index.html		index.html
ocs-bp-air-pollution.Rproj		ocs-bp-air-pollution.Rproj
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenCaseStudies

Important links

Disclaimer

License

Citation

Acknowledgments

Title

Motivation

Motivating question

Data

Learning Objectives

Analysis

Data import

Data wrangling

Data exploration

Statistical concepts

Other notes and resources

For users

For instructors

Target audience

Suggested homework

About

Releases

Packages

Contributors 5

Languages

opencasestudies/ocs-bp-air-pollution

Folders and files

Latest commit

History

Repository files navigation

OpenCaseStudies

Important links

Disclaimer

License

Citation

Acknowledgments

Title

Motivation

Motivating question

Data

Learning Objectives

Analysis

Data import

Data wrangling

Data exploration

Statistical concepts

Other notes and resources

For users

For instructors

Target audience

Suggested homework

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages