Classification in Unbalanced Data Problems

Importance of Investigating Unbalanced Data

Investigating unbalanced data is crucial because it can lead to several issues in classification models:

Poor Optimization

When data is unbalanced, the decision boundary can be rotated without significantly affecting the objective function. This can result in a flat objective function around the optimal solution, making it difficult to find the true optimal parameters.

Hessian Matrix Issues

The Hessian matrix, which indicates the curvature of the objective function, can have low values (indicating a flat surface). This means that the model might not be sensitive to changes in parameter values, leading to poor convergence during optimization.

Model Performance

Unbalanced data can cause the model to be biased towards the majority class, reducing its ability to accurately predict the minority class. This affects the overall performance and reliability of the model.

By generating and analyzing both balanced and unbalanced datasets, this script helps illustrate these issues, providing a visual and mathematical understanding of how unbalanced data affects logistic regression models.

Installation

To run the main R script, you will need some standard R packages which you can install with the following command:

install.packages(c("MASS", "ggplot2", "plotly"))

Key Steps of the R Script

Load Required Packages:
- Load necessary libraries for data generation, plotting, and interactive visualization.
Set Seed for Reproducibility:
- Set the seed to ensure that the random data generation is consistent across runs.
Define Number of Points in Each Class:
- Specify the number of data points for the majority and minority classes, allowing for both balanced and unbalanced datasets.
Define True Logistic Regression Parameters:
- Set the true coefficients for the logistic regression model to create a known decision boundary.
Generate Data:
- Generate multivariate normal data for both classes, including both feature values and class labels.
Fit Logistic Regression Model:
- Train a logistic regression model using the combined dataset to estimate the decision boundary.
Create a Grid for Plotting the Decision Boundary:
- Generate a grid of values to visualize the estimated decision boundary of the logistic regression model.
Plot the Data and Decision Boundary:
- Use ggplot2 to plot the data points along with the true and estimated decision boundaries, providing a visual comparison.
Plot Log-Likelihood Surface:
- Plot the log-likelihood surface using plotly to visualize how the model's objective function behaves with different parameter values.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
R		R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification in Unbalanced Data Problems

Importance of Investigating Unbalanced Data

Poor Optimization

Hessian Matrix Issues

Model Performance

Installation

Key Steps of the R Script

About

Releases

Packages

Languages

ccb-hms/unbalanced-data-example

Folders and files

Latest commit

History

Repository files navigation

Classification in Unbalanced Data Problems

Importance of Investigating Unbalanced Data

Poor Optimization

Hessian Matrix Issues

Model Performance

Installation

Key Steps of the R Script

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages