Clustering-BD

This is the repo for our project in the course Mining from Massive Datasets of MSc in Data and Web Science of AUTH

Installation of requirements

python 3.7 +

pip install -r requirements.txt

Usage

Run cure.py (task 2,3)

You should select these parameters:

-d dataset path
-th Threshold of representative points
-k value for hierarchical clustering

spark-submit --master local[*] --driver-memory 8g CURE.py -d Datasets/Data1.csv -k 8 -th 6

Run kmeans_hierarchical.py (task 4)

You should select these parameters:

-d dataset path
-s start value of k for fine tuning
-e end value of k for fine tuning

spark-submit --master local[*] --driver-memory 10g  kmeans_hierarchical.py -d Datasets/Data1.csv -s 2 -e 14

Run clustering.py

You should select these parameters:

-d dataset path
-a name of clustering algorithm {kmeans|bkmeans} bkmeans = bisecting kmeans is hierarchical clustring algorithm
-s start value of k for fine tuning
-e end value of k for fine tuning

spark-submit --master local[*] --driver-memory 8g clustering.py -d Datasets/Data1.csv -a kmeans -s 2 -e 14

Run insert_outliers.py (task 5)

You should select these parameters:

-d dataset path
-s saving path
-p percentage for uniform sampling
-n number of desired duplicate outliers

spark-submit --master local[*] --driver-memory 8g insert_outliers.py -d Datasets/Data1.csv -s Datasets/Data1_with_outliers -p 0.025 -n 20

Run find_outliers_kde.py (task 5)

You should select these parameters:

-d dataset path
-th threshold of representatives points
-k K value for hierarchical clustering

spark-submit --master local[*] --driver-memory 8g find_outliers_kde.py -d Datasets/Data1_with_outliers -th 8 -k 6

Run find_outliers_cure_based.py (task 5)

You should select these parameters:

-d dataset path
-th threshold of representatives points
-k K value for hierarchical clustering

spark-submit --master local[*] --driver-memory 8g find_outliers_cure_based.py -d Datasets/Data1_with_outliers -th 8 -k 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering-BD

Installation of requirements

Usage

Run cure.py (task 2,3)

Run kmeans_hierarchical.py (task 4)

Run clustering.py

Run insert_outliers.py (task 5)

Run find_outliers_kde.py (task 5)

Run find_outliers_cure_based.py (task 5)

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.gitattributes		.gitattributes
CURE.py		CURE.py
README.md		README.md
clustering.py		clustering.py
find_outliers_cure_based.py		find_outliers_cure_based.py
find_outliers_kde.py		find_outliers_kde.py
insert_outliers.py		insert_outliers.py
kmeans_hierarchical.py		kmeans_hierarchical.py
post_processing_kde_predicted_outliers.py		post_processing_kde_predicted_outliers.py
requirements.txt		requirements.txt
utils.py		utils.py

vlavrent/Clustering-BD

Folders and files

Latest commit

History

Repository files navigation

Clustering-BD

Installation of requirements

Usage

Run cure.py (task 2,3)

Run kmeans_hierarchical.py (task 4)

Run clustering.py

Run insert_outliers.py (task 5)

Run find_outliers_kde.py (task 5)

Run find_outliers_cure_based.py (task 5)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages