- The notebook in the main file uses the environment found in environment.yml
- The notebook used for mapping in the geopandas file uses the environment found in geoenvironment.yml
IHH Humanitarian Relief Foundation is an NGO that provides and maintains water wells in areas where clean water is inaccessible. Building an accurate classification model to predict whether a pump is functional, or in need of repairs will help to streamline their operations. These predictions will maximize their maintenance operations and will ensure clean and potable water is available to the people of Tanzania.
Tanzania has an area of 364,900 mi² which makes it the 30th largest country in the world. According to trade.gov, the managed national road network consists of 21058 miles of roadway, comprising 7944 miles of trunk and 13114 miles of regional roads. Due to the size of the country, the condition of infrastructure, and a limited budget of both capital and manpower, it is important to only deploy maintenance and repair efforts to the locations that are positively in need of repair.
Constructing a well in Tanzania can cost upwards of $10000 (The Living Water Project) depending on factors such as:
- cost of goods and labor
- necessary drilling depth
- amount of rock to drill through
- location of well
- cost of fuel (drilling and transportation)
Pumps in wells generally last 10 or more years but their parts do have a finite life span. The cost of repairing a well can range from a few hundred dollars to several thousand dollars. Sending repair efforts to wells that are predicted to need repairs but are in fact functioning (a false positive) use costly resources that could be put toward the wells that are actually in need of repair.
Data for this project is from Taarifa and the Tanzanian Ministry of Water.
This section houses functions and classes created to help later on in the notebook.
This section houses pipelines and transformers created to help later on in the notebook.
Length of DataFrame: 59400
amount_tsh - Total static head (amount water available to waterpoint)
date_recorded - The date the row was entered
funder - Who funded the well
gps_height - Altitude of the well
installer - Organization that installed the well
longitude - GPS coordinate
latitude - GPS coordinate
wpt_name - Name of the waterpoint if there is one
num_private -
basin - Geographic water basin
subvillage - Geographic location
region - Geographic location
region_code - Geographic location (coded)
district_code - Geographic location (coded)
lga - Geographic location
ward - Geographic location
population - Population around the well
public_meeting - True/False
recorded_by - Group entering this row of data
scheme_management - Who operates the waterpoint
scheme_name - Who operates the waterpoint
permit - If the waterpoint is permitted
construction_year - Year the waterpoint was constructed
extraction_type - The kind of extraction the waterpoint uses
extraction_type_group - The kind of extraction the waterpoint uses
extraction_type_class - The kind of extraction the waterpoint uses
management - How the waterpoint is managed
management_group - How the waterpoint is managed
payment - What the water costs
payment_type - What the water costs
water_quality - The quality of the water
quality_group - The quality of the water
quantity - The quantity of water
quantity_group - The quantity of water
source - The source of the water
source_type - The source of the water
source_class - The source of the water
waterpoint_type - The kind of waterpoint
waterpoint_type_group - The kind of waterpoint
The original DataFrame did not include the target column, 'status_group' so the original data and the target data needed to be joined.
Each column with null values was assessed to see if missing values could be imputed or if the column needed to be dropped. The only column of concern is 'scheme_name'. 47.42% of values were missing.
'date_recorded' - not relevant to model
'wpt_name' - not relevant to model
'num_private' - not relevant to model
'subvillage' - not relevant to model
'region_code' - similar info to region
'district_code' - similar info
'lga' - similar info
'recorded_by' - not relevant to model
'scheme_name' - not relevant to model
'funder' - not relevant to model
'extraction_type_group' - duplicate info
'extraction_type_class' - duplicate info
'management_group' - duplicate info
'payment_type' - duplicate info
'quality_group' - duplicate info
'quantity_group' - duplicate info
'source_type' - duplicate info
'source_class' - duplicate info
'waterpoint_type_group' - duplicate info
'status_group' - this is the target
This will make the problem into a binary classification, reducing complexity in the models.
The chart below shows us the how many wells fall into each status group.
The histplot below shows us how many wells are functional and how many need repair. The counts are binned based on construction year.
The chart below breaks down the type of waterpoints by whether they are functional or in need of repair.
Map of Tanzania with Location of Wells by Statusf (plotted in maps.ipynb in geopandas folder)
The baseline model, predicting all wells as functional performed at ~54% ± 0.00007 accuracy.
The first simple model, performed at 79.059% ± 0.00484 accuracy.
The Decision Tree Classifier model, performed at 78.474% ± 0.00606 accuracy.
The Random Forest Classifier model, performed at 82.144% ± 0.00505 accuracy.
The gradient booster model, performed at 76.658% ± 0.00781 accuracy.
The XGBoost Model model, performed at 80.173% ± 0.00607 accuracy.
- Logistic regression is faster to train than other models
- Logistic regression is more interpretable which makes it more useful for the non-technical presentation
- Logistic regression is less prone to overfitting
GridSearchCV is used to find the best hyperparameters for the chosen logistic regression model.
- Best Strategy for imputing numericals: mean
- Best C value: 1.0
- Best penalty: l2
- Best solver: liblinear
Many of the models had similar performance but I chose to use logistic regression because it was faster to train and more interpretable and is also less prone to over fitting. The final model for logistic regression, performs at ~79.6% accuracy.
Is there a benefit to knowing specifically which wells need repairs but are functioning as opposed to the wells that are not functioning?
Given more time, I would like to create a model to predict each status group originally given ('functional', 'non functional', and 'functional needs repair') instead of converting the target to a binary outcome.
What are the limiting factors in getting resources to the wells that need repairs?
- Maintenance Professionals?
- Time?
- Money?
- Parts?
- Knowledge?
├── data
│ ├── SubmissionFormat.csv
│ ├── test_set_values.csv
│ ├── training_set_labels.csv
│ └── training_set_values.csv
├── geopandas
│ ├── geoenvironment.yml
│ └── maps.ipynb
├── images
│ ├── handsatwell.jpeg
│ ├── location_needrepairs.png
│ ├── location_status_wells_tanzania.png
│ ├── roc_auc_model_compare.png
│ ├── tanzania_flag.jpg
│ ├── waterpoint_types.png
│ ├── well_status_binary.png
│ └── well_status_byyear_binary.png
├── main
| ├── environment.yml
│ └── index.ipynb
├── .gitignore
└── README.md