10 SEP 2023 - This is a re-upload of a repositaory that I collobarated on. The project was created as part of a 6 month Data Analystics Bootcamp administed by George Washington University. The original repo with all collobarators can be found at https://github.com/danig89/Covid-19_Vaccine_Hesitancy. The purpose of re-uploading it is to preserve the work and provide a way to pin it on my profile, for some reason I cannot do this as a collaborator.
This topic was chosen due to a shared interest in healthcare and public health as it relates to Covid-19. The group believes that AI/ML techniques can help in determining which demographic factors contribute to vaccine hesitancy.
The purpose of this project is to explore which factors are more likely to contribute to an individual’s hesitancy of getting (or not getting) the Covid-19 vaccine. By analyzing Covid-19, US Census, and demographic data, we hope to determine:
- Which demographic factors, such as income and proverty level, employment status, race/ethnicity, and access to transportation are more likely to contribute to vaccine hesitancy?
- Can we assume that counties that voted for Donald Trump are more likely to have higher populations of individuals who are vaccine hesitant?
- Data sources: US Census Demographic Data; Vaccine Hesitancy for COVID-19; Election, COVID, and Demographic Data by County; Urban-Rural Classification for Counties; County FIPS Codes
- Software: Jupyter Notebook; QuickDBD; pgAdmin 4; Tableau Public; Amazon Web Services
Pandas and numpy were used to clean the data and perform data cleaning and preliminary exploratory analysis. Further analysis was completed using Python. Seaborn and matplotlib were used for data exploration/visualization.
Postgres and PgAdmin was used to create and store the database. AWS was used for cloud storage of the database. SQLAlchemy was used to load and connect to the data.
Sklearn was used to split the data into training and testing sets, and to build and test our machine learning model.
Tableau Public was used to present the data and visualize our findings. Link_to_Tableau_Dashboard
During preprocessing, four databases were joined to create the a file to be used in the machine learning model. Next, the file was converted to a dataframe. Null rows and columns, and duplicate rows were then removed from the dataframe. Using numpy, estimated hesitancy data was converted from integers to string, creating a new “hesitancy” column. Data was split into “low hesitancy,” “moderate hesitancy,” and “high hesitancy.” This final data was saved as a CSV file and used for the machine learning model.
Description of preliminary feature engineering and preliminary feature selection, including their decision-making process
Variables were chosen as follow:
Independent variables
X = county_data_df[["percent_white","percent_hispanic", "percent_american_indian_alaska_native",
"percent_asian", "percent_black", "percent_hawaiian_pacific", "Poverty",
"ChildPoverty", "Drive","Carpool", "Transit", "Walk", "OtherTransp",
"WorkAtHome", "PrivateWork", "PublicWork", "SelfEmployed", "FamilyWork", "Unemployment",
"percentage20_Donald_Trump", "percentage20_Joe_Biden", "population_scaled"]]
Dependent Variable
y = county_data_df['hesitancy']
The data was split into training and testing sets using the random state parameter to guarantee that the same sequence of random numbers is generated each time we run the code.
After exploring various logistic regression models, such as muliple logistic regression, naïve random sampling, SMOTE oversampling, undersampling, and random forest classifier, the group chose to use the multiple logistic regression model, as it yielded an 77% accuracy, precision, and recall.
Advantages
- Best for categorical data
- Easier to train and interpret
- Provides good accuracy
Disadvantages
- Can be prone to overfitting if the number of observations is lesser than the number of features
- Cannot be used for non-linear data
- Not good for complex relationships
- The model performed well while predicting medium hesitancy, as expected.
- The model only predicted 1 datapoint as high hesitancy when it was truly low hesitancy.
- The model only predicted 2 datapoints as low hesitancy when it was truly high hesitancy.
- Poverty is the most important feature, followed by percentage of votes for Joe Biden in 2020 election.
- The third most important feature is percent of african american population in the county.
- There is moderate negative correlation between percentage of votes for Joe Biden (2020) and percentage of white population in a county.
- There is weak negative correlation between percentage of votes for Donald Trump (2020) and percentage of asian population as well as percentage of african american population in a county.
- There is significant difference at 95% CL for low and moderate hesitancy.
Which demographic factors, such as income and poverty level, employment status, race/ethnicity, and access to transportation are more likely to contribute to vaccine hesitancy?
- Poverty (economy)
- Percentage of votes for Joe Biden in 2020 election (political views)
- Percent of african american population in the county (race)
Can we assume that counties that voted for Donald Trump are more likely to have higher populations of individuals who are vaccine hesitant?
- Yes. Our analysis showed that counties that Trump carried in the 2020 Presidential election were more likely to have moderate hesitancy between 15% and 25% (76% counties vs. 46% counties) and less likely to have low hesitancy below 15% (11% of counties vs. 42% of counties)