In this project, we examine a dataset of contraceptive methods used by married Indonesian women to see what insights and predictions can be made based on various demographic and socioeconomic variables.
To make informed recommendations for further action to public health officials in order to improve the welfare of mothers and children. We hope to have an impact in empowering healthcare autonomy of female citizens in Indonesia, while reducing unwanted pregnancy, abortion rates, birth complications, & infant and maternal mortality rates.
Our dataset comes from the UC Irvine Machine Learning Repository, used with permission. Note that the full datasets are not included in this repository as they require additional permission to publish.
Feature Name | Type | Values |
---|---|---|
Woman's age | numerical | |
Woman's education | categorical | 1 = low, 2, 3, 4 = high |
Husband's education | categorical | 1 = low, 2, 3, 4 = high |
Number of children | numerical | |
Wife's religion | binary | 0 = Non-Islam, 1 = Islam |
Wife's now working? | binary | 0 = Yes, 1 = No |
Husband's occupation | categorical | 1, 2, 3, 4 |
Standard-of-living index | categorical | 1 = low, 2, 3, 4 = high |
Media exposure | binary | 0 = Good, 1 = Not good |
Feature Name | Type | Values |
---|---|---|
Contraceptive method used | class attribute | 1 = No-use 2 = Long-term method 3 = Short-term method |
Our methodology implements the CRISP-DM model for exploratory data analysis, cleaning, modeling, and evaluation.
We leverage machine learning models from Scikit-learn to determine the relationship between the features and the target variable. We also perform statistical analysis via SciPy Stats to further make inferences on the data.
Other tools used include Python, NumPy, and Pandas. Visualizations were created with MatPlotLib and Seaborn.
After verifying data integrity for our 1987 data, we delved right into EDA and data visualization. We examined the nuances of various dependent variables on the target variable. Based on our findings, we engineered new variables and partitioned contraception methods into a binary variable (with 0 meaning no contraception and 1 meaning uses long-short or short-short contraception). We performed regression analysis to determine the relationship between various variables. We then tested numerous classifiers to determine our best model, correcting for class imbalance and optimizing for recall accuracy. As an extension, we ran our fitted classifiers against the 2017 data to further evaluate our model.
The 1987 test data show that a slight skew toward younger women and older women who do not use any contraception - we hypothesize that younger women may not have started using contraception, and older women were not exposed to contraceptive methods at an early enough age:
Perhaps unsurprisingly, the number of children a woman has is also correlated to contraception use, but only up until a certain point:
The biggest predictor of contraception use is the Level of Education attained by the wife:
Similarly, the wife's Employment Status, Husband's Education Level, Standard of Living, and Media Exposure also had a positive correlation (not that the Media Exposure variable is defined inversely, 1 is poor exposure):
Religion had a negative impact on use of contraception, which is not surprising:
One of our engineered features - Neet Wife (Not Educated, Employed or in Training) - also showed a negative impact, which is as expected:
Our other engineered feature - Wife More Educated (than the husband) - showed a positive impact:
We tested Logistic Regression, Random Forest, and XGBoost; tested dummying the children
variable; tested Tomek Links resampling; and tuned hyperparameters via GridSearchCV.
Our best model uses an XGBClassifier:
Running against the cleaned and prepared 2017 data, we were able to obtain acceptable results:
Our best model performed well, with a recall of .79, F1 score of .79, and ROC-AUC of .78.
Testing our trained classifer on 2017 data yielded promising results: recall of .65, F1 score of .65, and ROC-AUC of .69.
Our analysis also suggests that a woman’s education
and media exposure
are strong indicators of the likelihood for her to use contraception.
It would be beneficial to direct public health initiative around these factors.
After submitting an application, we have received authorization from the Demographic and Health Surveys program (DHS) to access the original NICPS full datasets, which include more attributes and much larger sample sizes.
The original datasets will allow us to find deeper insights, as well as test our model on the population of Indonesia in subsequent years.
We would like to see what other targets may be predicted on to make further public health policy suggestions.
We would like to further optimize our model for a higher average recall, and continue to clean, implement features from, and predict on data from subsequent years, as well as evaluate against populations of other contries.
.
├── code/ # python helper functions file
├── functions.py # helper functions
├── crisp_dm_process/ # initial EDA and model notebook files
├── Initial_EDA.ipynb # notebook file with data exploration, insights, and takeaways
├── model_fitting_and_tuning.ipynb # notebook file for modeling trials and process
├── model_predict_1987_vs_2017.ipynb # notebook file for extended analysis on fitting our trained models on 2017 data
├── data/ # project datasets
├── images/ # visualizations; images for notebooks, README, and presentation slides
├── Contraception_Indonesia_FINAL.ipynb # primary project notebook
├── Contraception_Indonesia_Presentation.pdf # presentation slides
├── README.md # this readme
└── relevant_resources/ # additional resources for 1987 data
-
Dataset Origin:
the 1987 National Indonesia Contraceptive Prevalence Survey
-
Creator: Tjen-Sien Lim (limt@stat.wisc.edu)
-
Donor: Tjen-Sien Lim (limt@stat.wisc.edu)
-
Date: June 7, 1997
-
Web Source: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice