In this project, a dataset containing 11 clinical features for patients that classifies whether or not they have had a stroke will be analyzed. The purpose of this project is to derive insight on characteristics and statistics regarding these patients, to create a machine learning model that can determine whether a patient is at high risk of having a stroke, and determine which factors influence whether a patient has had a stroke.
Initial team meeting via Zoom. Discussed project requirements and assigned individual responsibilities. Weekly team meetings via Zoom occur twice per week during our scheduled class time. Team communication via Slack as needed to update team members of progress and to ask for assistance.
Caitlin Bishop, Alex Borden, Andrew Carlson, Brandon Castro
Data Source: The healthcare-dataset-stroke-data.csv from the Kaggle Website, credit to the author of the dataset fedesoriano.
Tools: Jupyter Notebook, Visual Studio Code, Python, Pandas, Numpy, Seaborn, MatPlotLib, Supervised Machine Learning Binary Classification Model, PostgreSQL, and Tableau.
Jupyter Notebook, along with Python's Pandas, Numpy, and seaborn libraries will be used to clean the data and perform an exploratory/statistical analysis.
Data will be stored and queried from in a PostgreSQL database.
Python's scikit-learn will be employed to create a supervised machine learning binary classification model using the stroke patient data csv file. The goal is to create a model that is able to determine whether or not a patient is at a high risk of having a stroke based on various characteristics of the patient.
Our dashboard will be hosted on Tableau Public software, to create a fully functioning and interactive dashboard and story to visualize and present data/findings.
Stroke Prediction Analysis Dashboard Link
- Caitlin Bishop: GitHub/Data Cleaning/Exploratory Analysis/Presentation
- Alex Borden: Technology/Dashboard
- Andrew Carlson: Machine Learning Model
- Brandon Castro: SQL-based Database
Link to Presentation on Google Slides
-
Selected topic
- Stroke Prediction Analysis
-
Reason they selected the topic
- Stroke prediction was the topic chosen because of our common background/interest in the healthcare field.
-
Description of the source of data
- The data contains 11 clinical features regarding medical patients including patient id, gender, age, hypertension status, heart disease status, marital status, employment type, residence type, average glucose levels, body mass index(BMI), and smoking status. There is also a target vector that states whether or not a given patient has had a stroke.
-
Questions they hope to answer with the data
- Can the classification model determine whether or not a patient could have a stroke?
- What factors influence whether or not a stroke would occur the most?
- Through our analysis, can we find who is most susceptible to getting a stroke?
- README.md
- Description of the communication protocols
Plan for storing data in a PostgreSQL database:
- Create a table in pgAdmin4 for which the csv file will be uploaded into.
- Create two other tables from the main table, one for biological characteristics of patients, and one for demographic.
- Perform queries to gather statistical insight on the data.
The csv dataset will be read in as a Pandas dataframe and will be used for the machine learning model. The output for the model will be the prediction of whether or not the patient had a stroke. As mentioned above, the goal is to create a model that is able to determine whether or not a patient is at a high risk of having a stroke based on the features of the patient in the dataset. If the output for a patient state that they had a stroke, then the patient may have a high risk of having a stroke according to their features.
See the gradient_boosting_model.ipynb file in the machine_learning folder for a description of data preprocessing, feature engineering, feature selection, data splitting for training/testing of model, and gradient boosting model creation.
The csv file was imported into a postgreSQL database as a table using the following SQL query:
Below is a portion of the resulting table:
We will be utilizing Tableau Public to create a story-based dashboard in combination with an interactive dashboard.
Stroke Prediction Analysis Story Link
Here is a sneak peak of the story points we will using inside Tableau.
Our interactive dashboard created in Tableau includes 8 different correlations for identifying trends in the stroke dataset includes the following:
- Averages
- Age & Stroke
- Gender & Work Type
- Heart Disease and Hypertension
- Impact of Marriage
- Impact of Residence Type
- Impact of Smoking Status
- BMI & Glucose Calculators
This dashboard is fully functional with a Gender & Work Type bar chart that identifies trends of stroke predictions in Male and Females.
Stroke Prediction Analysis Dashboard Link
See the gradient_boosting_model.ipynb file in the machine_learning folder for optimization methods used for the model, results after optimization, determination of feature importances, and conclusion of the machine learning analysis.
Finalized the Dashboard/Story and added Exploratory Analysis and Machine Learning to show importance within the Tableau story.
SQL Script tables ERD_DB_creation.sql
Below are portions of the resulting tables.
Biological Features Table:
Demographic Features Table:
Final Presentation