Skip to content

Commit

Permalink
write final report
Browse files Browse the repository at this point in the history
  • Loading branch information
reeteshsudhakar committed Dec 5, 2023
1 parent bd24d9a commit a625f8c
Show file tree
Hide file tree
Showing 9 changed files with 2,345 additions and 83 deletions.
91 changes: 60 additions & 31 deletions final.md

Large diffs are not rendered by default.

13 changes: 0 additions & 13 deletions midterm.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,6 @@ Our initial dataset for training the binary classification model, aimed at predi

1. **Data Cleaning and Feature Selection**: We initiated our process by identifying features with substantial missing data, specifically those with over 50% missing values. This criterion led to the elimination of several columns, primarily consisting of incomplete data or binary flags, which were impractical for meaningful imputation. This decision aligns with practices recommended in [Bao et al. (2019)](https://doi.org/10.1016/j.eswa.2019.02.033) and [de Castro Vieira et al. (2019)](https://doi.org/10.1016/j.asoc.2019.105640), who both recommend against the use of substantial missing data, as filling in missing values can lead to data distortion, and that many missing data points can harm accuracy. Bao et al. specifically stated that their cleaning approach started with: "for the features not filled out by 95% of applicants or above, we removed this feature." These papers, along with other sources from our initial literature review and background research, underscore the importance of data quality and relevance in predictive accuracy.

```python
def defaultClean(df: pd.DataFrame) -> None:
# ignore_columns: a list of columns that were deemed irrelevant
df.drop(ignore_columns, axis=1, inplace=True)
```

2. **Validation of Feature Removal**: To validate our feature removal decisions, we utilized a Decision Tree Classifier. We opted to utilize this approach based on information supported by [Emad Azhar Ali et al. (2021)](https://doi.org/10.24867/ijiem-2021-1-272). This helped us confirm that the eliminated features had minimal impact on predicting loan defaults, ensuring the retained data's relevance and quality, while also highlighting features that might have significant relevance towards predicting an individual defaulting on a home loan.

```python
Expand All @@ -43,13 +37,6 @@ Our initial dataset for training the binary classification model, aimed at predi

3. **Handling Categorical Data**: We then addressed the challenge of categorical data, transforming string or object data types into discrete numerical codes. This conversion is crucial for compatibility with machine learning algorithms, as noted in [Krainer and Laderman (2013)](https://doi.org/10.1007/s10693-013-0161-7‌). We utilized `pandas` for this approach, and a snippet of the code achieving this is shown below:

```python
str_columns = df.select_dtypes(['string', "object"]).columns
df[str_columns] = df[str_columns].astype("category")
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes + 1)
```

4. **Imputation of Missing Values**: For the remaining features, our strategy for handling missing values varied based on the feature context. For instance, null values in 'OWN_CAR_AGE' were interpreted as the absence of a car and replaced with zeros. This approach of context-sensitive imputation is supported by [Bao et al. (2019)](https://doi.org/10.1016/j.eswa.2019.02.033), emphasizing the importance of maintaining data integrity.

5. **Advanced Imputation Techniques**: For features where zero imputation was inappropriate, we initially applied `sklearn`’s SimpleImputer with -1 as a placeholder. Recognizing the limitations of this naive approach, we plan to explore more sophisticated methods, such as the K-Nearest Neighbors (KNN) imputer. However, our trial with KNN imputation proved time-consuming (the runtime was taking far too long, eliminating it from consideration). That being said, its potential for more accurate imputation, as suggested by [de Castro Vieira et al. (2019)](https://doi.org/10.1016/j.asoc.2019.105640), warrants future consideration in the final portion of the project.
Expand Down
Binary file added resources/final/lr-confusion-matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added resources/final/rf-confusion-matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added resources/final/svc-confusion-matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rf-confusion-matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 19 additions & 4 deletions src/classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay, f1_score, precision_score, roc_auc_score


def fit_svm_classifier(X, y):
Expand All @@ -23,16 +23,27 @@ def fit_random_forest_classifier(X, y):
pipeline.fit(X, y)
return pipeline

def fit_voting_classifier(X, y):
svm_pipeline = make_pipeline(MinMaxScaler(), SGDClassifier(loss="log_loss", random_state=0, class_weight="balanced", max_iter=10000))
lr_pipeline = make_pipeline(MinMaxScaler(), LogisticRegression(C=2, random_state=0, class_weight="balanced", max_iter=10000))
rf_pipeline = make_pipeline(MinMaxScaler(), RandomForestClassifier(max_depth=10, random_state=0, class_weight="balanced"))
pipeline = VotingClassifier(estimators=[('svm', svm_pipeline), ('lr', lr_pipeline), ('rf', rf_pipeline)], voting='soft')
pipeline.fit(X, y)
return pipeline


classifier_functions = {
"svm": fit_svm_classifier,
"lr": fit_lr_classifier,
"rf": fit_random_forest_classifier
"rf": fit_random_forest_classifier,
"voting": fit_voting_classifier
}

classifier_names = {
"svm": "Support Vector Machine",
"lr": "Logistic Regression",
"rf": "Random Forest"
"rf": "Random Forest",
"voting": "Voting"
}

def run_and_compare(train_X, train_y, test_x, test_y, model: str):
Expand All @@ -42,8 +53,12 @@ def run_and_compare(train_X, train_y, test_x, test_y, model: str):

fit_model = classifier_functions[model](train_X, train_y)
fit_model_balanced_accuracy = balanced_accuracy_score(test_y, fit_model.predict(test_x))
fit_model_f1_score = f1_score(test_y, fit_model.predict(test_x), average="weighted")
fit_model_precision_score = precision_score(test_y, fit_model.predict(test_x), average="weighted")

print(f"{model} balanced accuracy: {fit_model_balanced_accuracy}")
print(f"{model} f1 score: {fit_model_f1_score}")
print(f"{model} precision score: {fit_model_precision_score}")
plot_confusion_matrix(test_y, fit_model.predict(test_x), title=f"{model} Confusion Matrix")

def tune_hyperparameters(X, y, parameters, model):
Expand Down
Loading

0 comments on commit a625f8c

Please sign in to comment.