Skip to content

Data Utilization and Processing

K Bennett edited this page Dec 13, 2019 · 1 revision

For more information about the data sources we used in the app, please refer to Data section under Developer's guide.

Before carrying out analysis and corresponding data visualization, we need to use loaders to transform the raw data files first. There are several reasons for doing this instead of importing the raw data files directly:

  • Common factors or identifiers, e.g. county names, have to have the same string representations as well as the same data types.
  • Missing values should be either removed or imputed.
  • Not all of the social determinants data are relevant. So some filtering operations, either by algorithm or by hand, are needed.

+ Raw Data Loaders

Loaders are the set of scripts used to convert raw data files, e.g. Underlying Cause of Death, Despair, 2000-2002.txt, into R data structures like data.frame or tibble. Then, the generated files are stored under the init folder for loading during the app initialization. One of the most important features of the loaders is that they are semi-automated. If more data sets are going to be added for further analysis, simply uploading data files to the corresponding folder and add a path to the script is enough. For more information, please go to the README.md under /init folder.

+ Missing Data Imputation

Missing value in CDC Wonder Mortality rates took place in counties that have death numbers under 10 or have no records. The original downloaded Death of Despair mortality rate data in three-year-blocks have about one-third of counties that have either suppressed or missing mortality rate.

A Brief Introduction of Method Used

Most of the non-value counties are small in population, manually inserting a small number (e.g. 0.1) for death number would create a huge mortality rate. For the problem, we download the data from CDC Wonder for ICD-10 codes that exclude the cause we are focusing on, and then use All-Cause of Death to get the difference that shows the data for the cause we are looking at. As an example, in Death of Despair mortality rates, the ratio of missing values decreases from one-third of counties to one-thirtieth of counties. These counties that still have missing values have death numbers under 10 for All-Cause of Death or have no records.

We then use some imputation methods in R to deal with these one-thirtieth counties. Discard death number and population, we change the data.frame from a long format to a wide matrix, which rows represent counties and columns represent time-based mortality rates (size in 3147 × 6). As a supplement, we use a downloaded state-based data (also from CDC Wonder) in our imputation.

We separate the data into two parts, one has counties that are missing in at most 4 three-year-blocks, one has counties that are all missing or only missing in one three-year-block.

  • For the first part, which has less missing data, we use the package Amelia that helps us to set bounds of the imputed values, which lower bounded at 0 and upper bounded at a corresponded rate for death number of 9. We use Amelia to calculate 2 times and use the average of them for missing values. If Amelia failed, check if the corresponded state mortality rate is beyond the upper bound. If state mortality rate is smaller, we would just use the state mortality rate for the county; otherwise, we replace a closer value to the non-missing mortality rate in the county from two random generated values (a corresponded rate of death number from 1-4 and from 5-8).

  • For the second part, similar methods were used. If state mortality rate is smaller than the upper bound of the county, we replace the blank by the state mortality rate; otherwise (if it is beyond the upper bound), we just select a closer value to the corresponded state mortality rate from two random generated values (a corresponded rate of death number from 1-4 and from 5-8).

As all missing values were replaced, we convert the matrix back to a data.frame in a long format for later use.

References

+ Feature Filtering

There is a mapping file and we gave short titles and marked features should not be included. For example, we don't include any features related to mortality rates and features that are raw counts.

+ Cause of Death Filtering

The cause of death is selected based on the ICD-10 code. Currently, the selected causes of deaths are: Despair (F10-F19, X40-X49, X60-X84, Y10-Y19), Cancer(C00-C97), Cardiovascular Disease(I00-I99), Assault(X85-Y09), and a high-level overview of All-Cause of Death.