Data decent

Automated Software Engineering (Group 1) research project where we :

Create an intelligent data pruning mechanism
Perform statistical analysis of intelligently choosing 'least' amount of 'best' data for data mining tasks
Our report is available at - report

Abstract

Data quality is a critical metric in ensuring that machine learning models perform well on real-world tasks. More specifically, in the case of class-imbalanced datasets, sampling strategies play an important role in deciding the quality of data fed to machine learning models. Our study introduces a novel approach to data under-sampling called 'intelligent data pruning,' which employs advanced clustering techniques to identify and eliminate redundant or irrelevant data points. By significantly reducing dataset size without loss of 'informative' data samples, our method optimizes the model training process for low computational resource utilization while enhancing model generalization. We extensively compare our method with existing under-sampling and over-sampling techniques on various software engineering datasets using three different learners: Logistic Regression, Decision Tree, and Support Vector Machine. Traditional data sampling methods like random sampling, SMOTE as well as sophisticated techniques such as Gaussian Copula and Recursive Random Projection based oversampling techniques are included in our evaluation as benchmarks. Through rigorous experimentation and performance analysis, we demonstrate the superior accuracy of our proposed data pruning method across diverse datasets and learners. Additionally, we provide insights into the underlying mechanisms driving its effectiveness and its potential applications in real-world scenarios with large-scale datasets and computational constraints.

Datasets

Breast Cancer (classification)
Churn (classification)
JS_Vuln (classification)
Ambari_Vuln (classification)
Defect_Eclipse_JDT_Core (classification)
Defect_Eclipse_PDE_UI (classification)
Moodle_Vuln (classification)
Defect_Mylyn (classification)

Data sampling techniques used/created

Baselines

Technique	Type
Random Pruning	Under-sampling
Random Oversampling	Over-sampling
SMOTE	Over-sampling
SVM SMOTE	Over-sampling
Gaussian Copula-SDV	Over-sampling
Recursive Random Projections-RRP	Over-sampling

Proposed + devised technique

Technique	Type
Intelligent Data Pruning	Under-sampling

Learners used to test sampling techniques

Logistic Regression
SVM
Decision Tree

Evaluation metrics sampling techniques

Accuracy of learner
Precision of learner
Recall of learner
F1 score of learner
Average wins of each sampling technique based on above 4 metrics

How to run code

The code (notebooks) used has been provided under /src. For experimentation, notebooks under src/experiments showcase how to load the data and run for an interation. For running experiments in the larger context notebooks under /src/automation specifiy the details of all the experiments runs and is also the main source of data collected. Files in /src/automation will generate a csv file for each sampling technique.

/src/analyse_scott_knott.ipynb - This notebook takes the results from the previous steps and plots the scott knott graphs. These graphs are stored under /results/**/sk/ respectively for all the datasets.

src/runtime_analysis.ipynb - This notebook mesures the time taken by each sampling technique for 100% sampling.

src/data_distribution.ipynb - This notebook determines the class distribution for all datasets.

Results

Scott-knott plots

To find scott-knott plots for a particular metric for a given learner, go to results/{dataset_name}/sk/ for respective dataset

Radar plot of results

As we can see, the area covered by Intelligent Data Pruning (orange line) is much higher than other techniques. Hence, we conclude that intelligent pruning is much more scalable and achieves competitive model performance in comparison to other sampling techniques.

Citations

X. Ling, T. Menzies, C. Hazard, J. Shu and J. Beel, "Trading Off Scalability, Privacy, and Performance in Data Synthesis," in IEEE Access, vol. 12, pp. 26642-26654, 2024, doi: 10.1109/ACCESS.2024.3366556. keywords: {Synthetic data;Clustering algorithms;Data models;Engines;Biomedical imaging;Generative adversarial networks;Data privacy;Regression analysis;Classification algorithms;Scalability;Homomorphic encryption;Synthetic data generation;privacy preservation;regression;classification},
https://github.com/yrahul3910/SyntheticOversampling {RRP oversampling}

Team members

Rishi Singhal
Deepak Rajendran
Aditya Pai Brahmavar

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
data		data
results		results
src		src
.DS_Store		.DS_Store
A1.pdf		A1.pdf
A2.pdf		A2.pdf
Few Informative Data Samples are Good Enough Introducing Intelligent Data Pruning.pdf		Few Informative Data Samples are Good Enough Introducing Intelligent Data Pruning.pdf
README.md		README.md
data.zip		data.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data decent

Abstract

Datasets

Data sampling techniques used/created

Baselines

Proposed + devised technique

Learners used to test sampling techniques

Evaluation metrics sampling techniques

How to run code

Results

Scott-knott plots

Radar plot of results

Citations

Team members

About

Releases

Packages

Contributors 3

Languages

adipai/data-decent

Folders and files

Latest commit

History

Repository files navigation

Data decent

Abstract

Datasets

Data sampling techniques used/created

Baselines

Proposed + devised technique

Learners used to test sampling techniques

Evaluation metrics sampling techniques

How to run code

Results

Scott-knott plots

Radar plot of results

Citations

Team members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages