GeneticOversampler is a Python-based library designed to address the class imbalance problem in datasets through oversampling. It uses advanced techniques such as genetic algorithms, clustering, hybrid distance metrics, and machine learning models to generate synthetic samples, improving model performance on imbalanced data.
- Genetic Algorithm for Oversampling:
- Implements evolutionary techniques to generate synthetic minority class samples.
- Includes mutation, crossover, and fitness evaluation processes.
- Clustering-Based Sampling:
- Uses the CFSFDP (Clustering by Fast Search and Find of Density Peaks) algorithm to identify regions like inland, borderline, and trapped for guided oversampling.
- Hybrid Distance Metrics:
- Introduces HEEM (Hybrid Entropy-Enhanced Metric) for improved distance calculations across mixed data types.
- KNN and MICE Imputation:
- Handles missing data using advanced imputation methods.
- Evaluation:
- Provides tools for model evaluation using metrics such as ROC-AUC, precision, recall, and F1-score.
The core of the project is a genetic algorithm that generates synthetic samples based on the minority class distribution.
- Initialization:
- A population of synthetic samples is generated from minority class neighbors.
- Fitness Evaluation:
- Combines machine learning model confidence with domain-based penalties.
- Example formula for fitness evaluation:
Where:
F_fitness = max(α * P_model + β * (1 - sigmoid(L)), 0)
P_model
: Prediction confidence of the ML model.L
: Domain-based loss.α, β
: Weighting coefficients.
- Crossover and Mutation:
- Produces new samples by combining features from parent samples with random mutations for diversity.
- Inland/Borderline Loss:
L = 0.5 * [MaxDist(I, neighbors)^2 + max(0, Margin - MaxDist(I, M))^2]
- Trapped Loss:
L = 0.5 * (cosine similarity penalty + distance-based penalty)
Identifies minority class samples as:
- Inland: High-density regions with fewer majority class neighbors.
- Borderline: Samples close to majority class boundaries.
- Trapped: Minority samples surrounded by majority samples.
Density(i) = Σ_{j ≠ i} exp(-d_ij^2 / d_c^2)
Where:
d_ij
: Distance between samplesi
andj
.d_c
: Cutoff distance (quantile-based).
Importance = σ_w * Normalized(σ) + (1 - σ_w) * Normalized(Density)
Handles mixed data types (categorical and continuous) by computing weighted distances.
Distance(i, j) = √(Σ_k(|x_ik - x_jk| / (4 * std(k)))^2 + Σ_l(w_l * indicator(x_il ≠ x_jl)))
Where:
w_l
: Entropy-derived weight for categorical featurel
.
To handle missing data:
- KNN Imputation: Finds nearest neighbors and imputes missing values based on similarity.
- MICE (Multiple Imputation by Chained Equations):
- Iteratively predicts missing values using Random Forest regressors.
- Data Preprocessing:
- Normalize features.
- Handle missing values using KNN or MICE.
- Clustering:
- Apply CFSFDP to categorize minority samples into inland, borderline, and trapped regions.
- Synthetic Sample Generation:
- Use the genetic algorithm to generate synthetic samples based on clustering results.
- Model Training:
- Train ML models on the balanced dataset.
- Evaluation:
- Measure model performance using metrics such as ROC-AUC, F1-score, etc.