Skip to content

A collaborative machine learning project between Alexander Xin and Mike Flanagan

Notifications You must be signed in to change notification settings

mike-flanagan/contraceptive-use-in-indonesia

Repository files navigation

Contraceptive Method Choices of Married Women in Indonesia

Overview

In this project, we examine a dataset of contraceptive methods used by married Indonesian women to see what insights and predictions can be made based on various demographic and socioeconomic variables.

family_planning

Motivation

To make informed recommendations for further action to public health officials in order to improve the welfare of mothers and children. We hope to have an impact in empowering healthcare autonomy of female citizens in Indonesia, while reducing unwanted pregnancy, abortion rates, birth complications, & infant and maternal mortality rates.

Data

Our dataset comes from the UC Irvine Machine Learning Repository, used with permission. Note that the full datasets are not included in this repository as they require additional permission to publish.

Features Examined

Feature Name Type Values
Woman's age numerical
Woman's education categorical 1 = low, 2, 3, 4 = high
Husband's education categorical 1 = low, 2, 3, 4 = high
Number of children numerical
Wife's religion binary 0 = Non-Islam, 1 = Islam
Wife's now working? binary 0 = Yes, 1 = No
Husband's occupation categorical 1, 2, 3, 4
Standard-of-living index categorical 1 = low, 2, 3, 4 = high
Media exposure binary 0 = Good, 1 = Not good

Target Variable

Feature Name Type Values
Contraceptive method used class attribute 1 = No-use
2 = Long-term method
3 = Short-term method

Methods

Our methodology implements the CRISP-DM model for exploratory data analysis, cleaning, modeling, and evaluation.
We leverage machine learning models from Scikit-learn to determine the relationship between the features and the target variable. We also perform statistical analysis via SciPy Stats to further make inferences on the data.
Other tools used include Python, NumPy, and Pandas. Visualizations were created with MatPlotLib and Seaborn.

Approach

After verifying data integrity for our 1987 data, we delved right into EDA and data visualization. We examined the nuances of various dependent variables on the target variable. Based on our findings, we engineered new variables and partitioned contraception methods into a binary variable (with 0 meaning no contraception and 1 meaning uses long-short or short-short contraception). We performed regression analysis to determine the relationship between various variables. We then tested numerous classifiers to determine our best model, correcting for class imbalance and optimizing for recall accuracy. As an extension, we ran our fitted classifiers against the 2017 data to further evaluate our model.

Analysis

The 1987 test data show that a slight skew toward younger women and older women who do not use any contraception - we hypothesize that younger women may not have started using contraception, and older women were not exposed to contraceptive methods at an early enough age:
age_distribution age_binary

Perhaps unsurprisingly, the number of children a woman has is also correlated to contraception use, but only up until a certain point:
number_of_children

The biggest predictor of contraception use is the Level of Education attained by the wife:
education_level_1234
education level

Similarly, the wife's Employment Status, Husband's Education Level, Standard of Living, and Media Exposure also had a positive correlation (not that the Media Exposure variable is defined inversely, 1 is poor exposure):
is_employed husband_education_level standard_of_living media_exposure

Religion had a negative impact on use of contraception, which is not surprising:
religion

One of our engineered features - Neet Wife (Not Educated, Employed or in Training) - also showed a negative impact, which is as expected:
neet

Our other engineered feature - Wife More Educated (than the husband) - showed a positive impact:
wife_more_educated

Modeling

We tested Logistic Regression, Random Forest, and XGBoost; tested dummying the children variable; tested Tomek Links resampling; and tuned hyperparameters via GridSearchCV. Our best model uses an XGBClassifier:
1987_results

Running against the cleaned and prepared 2017 data, we were able to obtain acceptable results: 2017_results

Conclusion

Our best model performed well, with a recall of .79, F1 score of .79, and ROC-AUC of .78.
Testing our trained classifer on 2017 data yielded promising results: recall of .65, F1 score of .65, and ROC-AUC of .69.
Our analysis also suggests that a woman’s education and media exposure are strong indicators of the likelihood for her to use contraception.
It would be beneficial to direct public health initiative around these factors.

Further Actions

After submitting an application, we have received authorization from the Demographic and Health Surveys program (DHS) to access the original NICPS full datasets, which include more attributes and much larger sample sizes.

The original datasets will allow us to find deeper insights, as well as test our model on the population of Indonesia in subsequent years.

We would like to see what other targets may be predicted on to make further public health policy suggestions.

We would like to further optimize our model for a higher average recall, and continue to clean, implement features from, and predict on data from subsequent years, as well as evaluate against populations of other contries.

Repository Structure

.
├── code/                                     # python helper functions file
    ├── functions.py                          # helper functions
├── crisp_dm_process/                         # initial EDA and model notebook files 
    ├── Initial_EDA.ipynb                     # notebook file with data exploration, insights, and takeaways  
    ├── model_fitting_and_tuning.ipynb        # notebook file for modeling trials and process
    ├── model_predict_1987_vs_2017.ipynb      # notebook file for extended analysis on fitting our trained models on 2017 data
├── data/                                     # project datasets
├── images/                                   # visualizations; images for notebooks, README, and presentation slides
├── Contraception_Indonesia_FINAL.ipynb       # primary project notebook  
├── Contraception_Indonesia_Presentation.pdf  # presentation slides
├── README.md                                 # this readme
└── relevant_resources/                       # additional resources for 1987 data

Bibliography

  1. Dataset Origin:

              the 1987 National Indonesia  
             Contraceptive Prevalence Survey  
    
  2. Creator: Tjen-Sien Lim (limt@stat.wisc.edu)

  3. Donor: Tjen-Sien Lim (limt@stat.wisc.edu)

  4. Date: June 7, 1997

  5. Web Source: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice

Authors
Alexander Xin & Mike Flanagan

GitHub | GitHub
LinkedIn | LinkedIn

About

A collaborative machine learning project between Alexander Xin and Mike Flanagan

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published