To utilize the real life disaster datasets provided by Figure Eight to perform the followings:
- Build an ETL Pipeline to clean the data.
- Build a supervised learning model to categorize messages, using scikit-learn
Pipeline
andFeatureUnion
. - Build a Flask web app to:
- Take an user input message and get the classification results for 36 categories (ex.
aid_related
,weather_related
,direct_report
, etc.). - Display visualizations to describe the datasets.
- Take an user input message and get the classification results for 36 categories (ex.
In addition to the Anaconda distribution of Python (versions 3.*), following libraries need to be installed manually.
-
Navigate yourself to the project's root directory.
-
To run ETL pipeline that cleans data and stores in database:
python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/DisasterResponse.db
- To run ML pipeline that trains classifier and saves it:
python models/train_classifier.py data/DisasterResponse.db models/classifier.pkl
- To run the web app:
cd app; pwd
python run.py
- Finally, go to http://0.0.0.0:3001/ to find visualizations describing the dataset. You can also input text message to see which categories it falls into, according to the trained model's prediction.
data/
directory includes ETL scriptprocess_data.py
to clean data and stores in a SQLite databasemodels/
directory includes ML scripttrain_classifier.py
to train a classifier and store it into a pickle fileapp/
directory includes web apprun.py
to handle visualizations using data from SQLite database
As we see from the screenshot above, this dataset is imbalanced. For reference, This Google document defines Degree of imbalance to be "Extreme" when "Proportion of Minority Class <1% of the data set".
The side effect could be that the training model will spend most of its time on majority examples and not learn enough from minority ones.
Typical approaches to tackle this issue would be to:
- Downsampling: Use subset of the majority class samples
- Oversample: Increase the minority class samples (usually synthetically)
Libraries like imbalanced-learn provide implementations of these techniques, including Synthetic Minority Oversampling Technique (SMOTE).
- Figure Eight for preparing the datasets
- Udacity Data Science Nanodegree