write final report

reeteshsudhakar · Dec 5, 2023 · a625f8c · a625f8c
1 parent bd24d9a
commit a625f8c
Show file tree

Hide file tree

Showing 9 changed files with 2,345 additions and 83 deletions.
diff --git a/final.md b/final.md
diff --git a/midterm.md b/midterm.md
@@ -16,12 +16,6 @@ Our initial dataset for training the binary classification model, aimed at predi
 
 1. **Data Cleaning and Feature Selection**: We initiated our process by identifying features with substantial missing data, specifically those with over 50% missing values. This criterion led to the elimination of several columns, primarily consisting of incomplete data or binary flags, which were impractical for meaningful imputation. This decision aligns with practices recommended in [Bao et al. (2019)](https://doi.org/10.1016/j.eswa.2019.02.033) and [de Castro Vieira et al. (2019)](https://doi.org/10.1016/j.asoc.2019.105640), who both recommend against the use of substantial missing data, as filling in missing values can lead to data distortion, and that many missing data points can harm accuracy. Bao et al. specifically stated that their cleaning approach started with: "for the features not filled out by 95% of applicants or above, we removed this feature." These papers, along with other sources from our initial literature review and background research, underscore the importance of data quality and relevance in predictive accuracy.
 
-    ```python
-    def defaultClean(df: pd.DataFrame) -> None:
-        # ignore_columns: a list of columns that were deemed irrelevant
-        df.drop(ignore_columns, axis=1, inplace=True)
-    ```
-
 2. **Validation of Feature Removal**: To validate our feature removal decisions, we utilized a Decision Tree Classifier. We opted to utilize this approach based on information supported by [Emad Azhar Ali et al. (2021)](https://doi.org/10.24867/ijiem-2021-1-272). This helped us confirm that the eliminated features had minimal impact on predicting loan defaults, ensuring the retained data's relevance and quality, while also highlighting features that might have significant relevance towards predicting an individual defaulting on a home loan. 
 
     ```python
@@ -43,13 +37,6 @@ Our initial dataset for training the binary classification model, aimed at predi
 
 3. **Handling Categorical Data**: We then addressed the challenge of categorical data, transforming string or object data types into discrete numerical codes. This conversion is crucial for compatibility with machine learning algorithms, as noted in [Krainer and Laderman (2013)](https://doi.org/10.1007/s10693-013-0161-7‌). We utilized `pandas` for this approach, and a snippet of the code achieving this is shown below:
 
-    ```python
-    str_columns = df.select_dtypes(['string', "object"]).columns
-    df[str_columns] = df[str_columns].astype("category")
-    cat_columns = df.select_dtypes(['category']).columns
-    df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes + 1)
-    ```
-
 4. **Imputation of Missing Values**: For the remaining features, our strategy for handling missing values varied based on the feature context. For instance, null values in 'OWN_CAR_AGE' were interpreted as the absence of a car and replaced with zeros. This approach of context-sensitive imputation is supported by [Bao et al. (2019)](https://doi.org/10.1016/j.eswa.2019.02.033), emphasizing the importance of maintaining data integrity.
 
 5. **Advanced Imputation Techniques**: For features where zero imputation was inappropriate, we initially applied `sklearn`’s SimpleImputer with -1 as a placeholder. Recognizing the limitations of this naive approach, we plan to explore more sophisticated methods, such as the K-Nearest Neighbors (KNN) imputer. However, our trial with KNN imputation proved time-consuming (the runtime was taking far too long, eliminating it from consideration). That being said, its potential for more accurate imputation, as suggested by [de Castro Vieira et al. (2019)](https://doi.org/10.1016/j.asoc.2019.105640), warrants future consideration in the final portion of the project.

diff --git a/resources/final/lr-confusion-matrix.png b/resources/final/lr-confusion-matrix.png
diff --git a/resources/final/rf-confusion-matrix.png b/resources/final/rf-confusion-matrix.png
diff --git a/resources/final/svc-confusion-matrix.png b/resources/final/svc-confusion-matrix.png
diff --git a/rf-confusion-matrix.png b/rf-confusion-matrix.png
diff --git a/src/classifier.py b/src/classifier.py
@@ -3,9 +3,9 @@
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import MinMaxScaler
 from sklearn.linear_model import SGDClassifier, LogisticRegression
-from sklearn.ensemble import RandomForestClassifier
+from sklearn.ensemble import RandomForestClassifier, VotingClassifier
 from sklearn.model_selection import RandomizedSearchCV
-from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay
+from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score, precision_score, roc_auc_score
 
 
 def fit_svm_classifier(X, y):
@@ -23,16 +23,27 @@ def fit_random_forest_classifier(X, y):
     pipeline.fit(X, y)
     return pipeline
 
+def fit_voting_classifier(X, y):
+    svm_pipeline = make_pipeline(MinMaxScaler(), SGDClassifier(loss="log_loss", random_state=0, class_weight="balanced", max_iter=10000))
+    lr_pipeline = make_pipeline(MinMaxScaler(), LogisticRegression(C=2, random_state=0, class_weight="balanced", max_iter=10000))
+    rf_pipeline = make_pipeline(MinMaxScaler(), RandomForestClassifier(max_depth=10, random_state=0, class_weight="balanced"))
+    pipeline = VotingClassifier(estimators=[('svm', svm_pipeline), ('lr', lr_pipeline), ('rf', rf_pipeline)], voting='soft')
+    pipeline.fit(X, y)
+    return pipeline
+
+
 classifier_functions = {
     "svm": fit_svm_classifier,
     "lr": fit_lr_classifier,
-    "rf": fit_random_forest_classifier
+    "rf": fit_random_forest_classifier,
+    "voting": fit_voting_classifier
 }
 
 classifier_names = {
     "svm": "Support Vector Machine",
     "lr": "Logistic Regression",
-    "rf": "Random Forest"
+    "rf": "Random Forest", 
+    "voting": "Voting"
 }
 
 def run_and_compare(train_X, train_y, test_x, test_y, model: str):
@@ -42,8 +53,12 @@ def run_and_compare(train_X, train_y, test_x, test_y, model: str):
 
     fit_model = classifier_functions[model](train_X, train_y)
     fit_model_balanced_accuracy = balanced_accuracy_score(test_y, fit_model.predict(test_x))
+    fit_model_f1_score = f1_score(test_y, fit_model.predict(test_x), average="weighted")
+    fit_model_precision_score = precision_score(test_y, fit_model.predict(test_x), average="weighted")
 
     print(f"{model} balanced accuracy: {fit_model_balanced_accuracy}")
+    print(f"{model} f1 score: {fit_model_f1_score}")
+    print(f"{model} precision score: {fit_model_precision_score}")
     plot_confusion_matrix(test_y, fit_model.predict(test_x), title=f"{model} Confusion Matrix")
 
 def tune_hyperparameters(X, y, parameters, model):