Skip to content

Latest commit

 

History

History
169 lines (115 loc) · 10.8 KB

README.md

File metadata and controls

169 lines (115 loc) · 10.8 KB

Best Partition Micro-F1

This code is part of my PhD research at PPG-CC/DC/UFSCar. The aim is select the best partition for multilabel classification using the Micro-F1 metric.

How to cite

@misc{Gatto2021, author = {Gatto, E. C.}, title = {Select Best Partition with Micro-F1}, year = {2021}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/cissagatto/Best-Partition-MicroF1}}}

Source Code

This code source is composed of the project R to be used in RStudio IDE and also the following scripts R:

  1. libraries.R
  2. utils.R
  3. validation.R
  4. bestPartitions.R
  5. run.R
  6. microf1.R
  7. bpmi_config_files.R

Preparing your experiment

STEP 1

A file called datasets-original.csv must be in the root project folder. This file is used to read information about the datasets and they are used in the code. We have 90 multilabel datasets in this .csv file. If you want to use another dataset, please, add the following information about the dataset in the file:

Parameter Status Description
Id mandatory Integer number to identify the dataset
Name mandatory Dataset name (please follow the benchmark)
Domain optional Dataset domain
Instances mandatory Total number of dataset instances
Attributes mandatory Total number of dataset attributes
Labels mandatory Total number of labels in the label space
Inputs mandatory Total number of dataset input attributes
Cardinality optional
Density optional
Labelsets optional
Single optional
Max.freq optional
Mean.IR optional
Scumble optional
TCS optional
AttStart mandatory Column number where the attribute space begins*
AttEnd mandatory Column number where the attribute space ends
LabelStart mandatory Column number where the label space begins
LabelEnd mandatory Column number where the label space ends
Distinct optional
xn mandatory Value for Dimension X of the Kohonen map
yn mandatory Value for Dimension Y of the Kohonen map
gridn mandatory X times Y value. Kohonen's map must be square
max.neigbors mandatory The maximum number of neighbors is given by LABELS -1
  • Because it is the first column the number is always 1.

STEP 2

To run this experiment you need the X-Fold Cross-Validation files and they must be compacted in tar.gz format. You can download these files, with 10-folds, ready for multiple multilabel dataset by clicking here. For a new dataset, in addition to including it in the datasets-original.csv file, you must also run this code here. In the repository in question you will find all the instructions needed to generate the files in the format required for this experiment. The tar.gz file can be placed on any folder on your computer or cluster. The absolute path of the file should be passed as a parameter in the configuration file that will be read by microF1.R script. The dataset will be loaded from there.

STEP 3

You need to have installed all the R packages required to execute this code on your machine. Check out which are needed in the file libraries.R. This code does not provide any type of automatic package installation! You can use the Conda environment that I created to perform this experiment. Below are the links to download the files.

| download txt | download yml | download yaml |

Try to use the command below to extract the environment to your computer:

conda env create -file AmbienteTeste.yaml

See more information about Conda environments here

STEP 4

You will need the previously generated partitions by one of the following codes:

You must use the results generated from the OUTPUT folder in those source code. They must be compressed into a TAR.GZ file and placed in a folder on your computer. The absolute path of this folder must be passed as a parameter in the configuration file. Please see the example in the PARTITIONS folder in this source code.

STEP 5

Confirms if the folder Utils contains the following files: Clus.jar, R_csv_2_arff.jar, and weka.jar, and also the folder lib with commons-math-1.0.jar, jgap.jar, weka.jar and Clus.jar. Without these jars, the code not runs.

STEP 6

To run this code you will need a configuration file saved in csv format and with the following information:

Config Value
Dataset_Path Absolute path to the folder where the dataset's tar.gz is stored
Temporary_Path Absolute path to the folder where temporary processing will be performed*
Partitions_Path Absolute path to the folder where partitions are store
similarity Choose which one to run: jaccard, rogers, random1 and random2
dataset_name Dataset name according to datasets-original.csv file
number_dataset Dataset number according to datasets-original.csv file
number_folds Number of folds used in cross validation
number_cores Number of cores for parallel processing
  • Use folders like /dev/shm, tmp or scratch here.

Example:

You can save configuration files wherever you want. The absolute path will be passed as a command line argument.

Folder Strucutre

Software Requirements

This code was develop in RStudio Version 1.4.1106 © 2009-2021 RStudio, PBC "Tiger Daylily" (2389bc24, 2021-02-11) for Ubuntu Bionic Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.8 Chrome/69.0.3497.128 Safari/537.36. The R Language version was: R version 4.1.0 (2021-05-18) -- "Camp Pontanezen" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit).

Hardware Requirements

This code may or may not be executed in parallel, however, it is highly recommended that you run it in parallel. The number of cores can be configured via the command line (number_cores). If number_cores = 1 the code will run sequentially. In our experiments, we used 10 cores. For reproducibility, we recommend that you also use ten cores. This code was tested with the birds dataset in the following machine:

System:

Host: bionote | Kernel: 5.8.0-53-generic | x86_64 bits: 64 | Desktop: Gnome 3.36.7 | Distro: Ubuntu 20.04.2 LTS (Focal Fossa)

CPU:

Topology: 6-Core | model: Intel Core i7-10750H | bits: 64 | type: MT MCP | L2 cache: 12.0 MiB | Speed: 800 MHz | min/max: 800/5000 MHz Core speeds (MHz): | 1: 800 | 2: 800 | 3: 800 | 4: 800 | 5: 800 | 6: 800 | 7: 800 | 8: 800 | 9: 800 | 10: 800 | 11: 800 | 12: 800 |

Then the experiment was executed in a cluster at UFSCar.

Important: we used the CLUS classifier in this experiment. This implies generating all physical ARFF training, validating, and testing files for each of the generated random partitions. Our code generates the partitions first in memory and then saves them to the HD. However, to avoid memory problems, immediately after saving to HD, the files are validated (or tested) and then deleted. Even so, make sure you have enough space on your HD and RAM for this procedure.

Results

The results stored in the folder OUTPUT it will be used in the next phase: Best-Partition-Silhoute, Test-Best-Partition-MacroF1, Test-Best-Partition-MicroF1 or Test-Best-Partition-Silhouette. The result for a dataset must be put in the folder PARTITIONS in the respective code.

RUN

To run the code, open the terminal, enter the ~/Best-Partition-MiF1-Clus/R folder, and type

Rscript microf1.R [absolute_path_to_config_file]

Example:

Rscript microf1.R "~/Best-Partition-MiF1-Clus/BPMI-Config-Files/jaccard/BPMI-GpositiveGO.csv"

RESULTS

The results are stored in a folder called reports in the project root.

DOWNLOAD RESULTS

[Click here]

Acknowledgment

  • This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
  • This study was financed in part by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brasil (CNPQ) - Process number 200371/2022-3.
  • The authors also thank the Brazilian research agencies FAPESP financial support.

Contact

elainececiliagatto@gmail.com

Links

| Site | Post-Graduate Program in Computer Science | Computer Department | Biomal | CNPQ | Ku Leuven | Embarcados | Read Prensa | Linkedin Company | Linkedin Profile | Instagram | Facebook | Twitter | Twitch | Youtube |

Thanks