- HTML: https://www.opencasestudies.org/ocs-bp-air-pollution/
- GitHub: https://github.com/opencasestudies/ocs-bp-air-pollution/
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com//opencasestudies/ocs-bp-air-pollution. Predicting Annual Air Pollution (Version v1.0.0).
We would like to acknowledge Roger Peng, Megan Latshaw, and Kirsten Koehler for assisting in framing the major direction of the case study.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.
Predicting Annual Air Pollution
Machine learning methods have been used to predict air pollution levels when traditional monitoring systems are not available in a particular area or when there is not enough spatial granularity with current monitoring systems.
We will use machine learning methods to predict annual air pollution levels spatially within the US based on data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data.
- Can we predict annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?
The data that we will use in this case study come from a gravimetric air pollution monitor system operated by the US Enivornmental Protection Agency (EPA) that measures fine particulate matter (PM2.5) in the United States (US). We will use data from 876 gravimetric monitors in in the contiguous US in 2008.
Roughly 90% of these monitors are located within cities.
Hence, there is an equity issue in terms of capturing the air pollution levels of more rural areas. To get a better sense of the pollution exposures for the individuals living in these areas, methods like machine learning can be useful to estimate or predict air pollution levels in areas with little to no monitoring.
We will use data related to population density, urbanization, road density, as well as, NASA satellite pollution data and chemical modeling data to predict the monitoring values captured from this air pollution monitoring system.
The data for these 48 predictors comes from the US Enivornmental Protection Agency (EPA), the National Aeronautics and Space Administration (NASA), the US Census, and the National Center for Health Statistics (NCHS).
All of our data was previously collected by a researcher at the Johns Hopkins School of Public Health who studies air pollution and climate change.
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data Science Learning Objectives:
- Familiarity with the tidymodels ecosystem
- Ability to evaluate correlation among predictor variables
(
corrplot
andGGally
) - Ability to implement tidymodels packages such as
rsample
to split the data into training and testing sets as well as cross validation sets. - Ability to use the
recipes
,parsnip
, andworkflows
to train and test a linear regression model and random forest model - Demonstrate how to visualize geo-spatial data using
ggplot2
Statistical Learning Objectives:
- Basic understanding the utility of machine learning for prediction and classification
- Understanding of the need for training and test sets
- Understanding of the utility of cross validation
- Understanding of random forest
- How to interpret root mean squared error (rmse) to assess performance for prediction
This case study focuses on machine learning methods. We demonstrate how to train and test a linear regression model and a random forest model.
The data is imported from a CSV file using the readr
package.
This case study does not demonstrate very many data wrangling methods.
However we do cover the mutate()
and across
functions of the dplyr
package in the Data wrangling section. In the Data visualzation, some
wrangling was required including combining data using the inner_join()
function of the dplyr
package, using the separate
function of the
tidyr
package to make two columns out of one, and the str_to_title()
function of the stringr
package to change the format of some character
strings.
We demonstrate how to get a summary of a relatively large set of
predictors using the skim
package, as well as how to evaluate
correlation among all variables using the corrplot
package and among
specific variables with more information using the GGally
package.
We cover the basics of machine learning: 1) the difference between prediction and classification 2) the importance of training and testing 3) the concept of cross validation and tuning 4) how random forest works
- A review of tidymodels
- A course on tidymodels by Julia Silge
- More examples, explanations, and info about tidymodels development from the developers
- A guide for pre-processing with recipes
- A guide for using GGally to create correlation plots
- A guide for using parsnip to try different algorithms or engines
- A list of recipe functions
- A great blog post about cross validation
- A discussion about evaluating model performance for a deeper explanation about how to evaluate model performance
- RStudio cheatsheets
- An explanation of supervised vs unsupervised machine learning and bias-variance trade-off.
- A thorough explanation of principal component analysis.
- If you have access, this is a great discussion about the difference between independence, orthogonality, and lack of correlation.
- Great video explanation of PCA.
Terms and concepts covered:
Tidyverse
Imputation
Transformation
Discretization
Dummy Variables
One Hot Encoding
Data Type Conversions
Interaction
Normalization
Dimensionality Reduction/Signal Extraction
Row Operations
Near Zero Varaince
Parameters and Hyper-parameters
Supervised and Unspervised Learning
Principal Component Analysis
Linear Combinations
Decision Tree
Random Forest
Packages used in this case study:
Package | Use in this case study |
---|---|
here | to easily load and save data |
readr | to import the CSV file data |
dplyr | to view/arrange/filter/select/compare specific subsets of the data |
skimr | to get an overview of data |
summarytools | to get an overview of data in a different style |
magrittr | to use the %<>% pipping operator |
corrplot | to make large correlation plots |
GGally | to make smaller correlation plots |
rsample | to split the data into testing and training sets and to split the training set for cross-validation |
recipes | to pre-process data for modeling in a tidy and reproducible way and to extract pre-processed data (major functions are recipe() , prep() and various transformation step_*() functions, as well as bake which extracts pre-processed training data (used to require juice() ) and applies recipe preprocessing steps to testing data). See here for more info. |
parsnip | an interface to create models (major functions are fit() , set_engine() ) |
yardstick | to evaluate the performance of models |
broom | to get tidy output for our model fit and performance |
ggplot2 | to make visualizations with multiple layers |
dials | to specify hyper-parameter tuning |
tune | to perform cross validation, tune hyper-parameters, and get performance metrics |
workflows | to create modeling workflow to streamline the modeling process |
vip | to create variable importance plots |
randomForest | to perform the random forest analysis |
doParallel | to fit cross validation samples in parallel |
stringr | to manipulate the text the map data |
tidyr | to separate data within a column into multiple columns |
rnaturalearth | to get the geometry data for the earth to plot the US |
maps | to get map database data about counties to draw them on our US map |
sf | to convert the map data into a data frame |
lwgeom | to use the sf function to convert the map geographical data |
rgeos | to use geometry data |
patchwork | to allow plots to be combined |
There is a Makefile
in this folder that allows you to type
make
to knit the case study contained in the index.Rmd
to
index.html
and it will also knit the README.Rmd
to a
markdown file (README.md
).
This case study is intended to introduce fundamental topics in Machine Learning and to introduce how to implement model prediction using the tidymodels ecosystem of packages in R.
This case study is intended for those with some familiarity with linear regression and R programming.
Students can predict air pollution monitor values using a different algorithm and provide an explanation for how that algorithm works and why it may be a good choice for modeling this data.