GitHub - Vimalathithya007/summer-analytics-hackathon1: MULTILABEL VACCINE PREDICTION

MULTILABEL VACCINE PROBAILITY PREDICTION

using df.isnull().sum() lists out no.of null values in each feature/column
out of 36 features given (30 features has missing values)
36 feature also consist of different data types (int/float/object(string))
In most of the binary features NULL value is replace by MODE since either of 1 class is dominant
Remaining Categorical Values are encoded using LabelEncoder from Scikit Learn
Each features is represented using sns.countplot() to visualize
Heatmap is plotted at the end to view correlation b/w different features

Before model selection test_train_split() to divide testing and training data with shuffle=True and stratified=True which improves random shuffling in the dataset
Total 5 models have been used
- i)Logistic Regression
- ii)Bernoulli Naive bayes
- iii)SVM - RBF Kernel function
- iv)Random Forest
- v)XGBoost
models are trained with only one label xyz_vaccine/h1n1_vaccine
since the model has high imbalance for xyz_vaccine/h1n1_vaccine compared to seasonal_vaccine as shown below
- model which performed good on this label can be used for prediction of another label ( seasonal_vaccine)
- Generic code for any model implement using a scikit pipeline
```
    model_1 = make_pipeline(
    LogisticRegression(solver='sag') )
    model_1.fit(X_train_xyz, y_train_xyz)
    plot_roc_curve(model_1, X_test_xyz, y_test_xyz, name='logistic regression - xyz vaccine')
    plt.show()
```

By plotting roc_auc curve and comparing relative roc_auc_scores " Tree based model " outperformed in classification task than remaining models .
Both Random forest and XGBoost gave roc_auc_score of "0.83" and "0.82" respectively.

Tree based models are best for this classification task , eventhough we have attained roc_score of 0.83 this can be improved a lot by detail Data preprocessing process using KNNImputer and Iterative Imputer instead of simple mode replacement (which may be computationally intensive but gives a better result)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
readme_img		readme_img
Readme.md		Readme.md
model-solution.ipynb		model-solution.ipynb
submission.csv		submission.csv