Skip to content

IntelCompH2020/taxonomical-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model trainer for the classification service

This repository can be used to train supervised classifiers on new taxonomies.


Setup

Clone this repository and install the required libraries as follows:

git clone git@github.com:IntelCompH2020/taxonomical-classification.git
cd taxonomical-classification
bash setup_environment.sh

How to use

1) Create a working directory for the new taxonomy.

The following command will create a new directory inside ./taxonomies with the given TAXONOMY_NAME.

TAXONOMY_NAME=<ADD_TAXONOMY_NAME_HERE>
bash new_taxonomy.sh taxonomy_name=$TAXONOMY_NAME

The new directory will have the following folder structure:

  • output/: Directory where the model checkpoints will be stored.
  • logs/: Directory where the .err and .out slurm log files will be stored.
  • tb/: Directory where tensorboard files will be stored, for visualization of the training progess.
  • hyperparameters.config.sh: Configuration file where the user can modify the hyperparameters' default values.
  • generate_run.sh: Script to generate the run.sh file based on the hyperparameters.config.sh.
  • finetune_classifier.py: Main code that runs on top of the Trainer class from HuggingFace.
  • data_loader.py: Script to load any parquet table for model training or inference.

2) Download the model to be finetuned.

Once the working directory has been created, the base model to be finetuned has to be added to the ./models directory.

This can be done by simply dragging a checkpoint from your local file system, or alternatively it can be downloaded from the internet. We provide a few bash scripts that download models publicly available in the HuggingFace Hub.

As an example, the following command would download a RoBERTa-large model inside ./models/roberta-large:

bash models/download_roberta_large.sh

3) Go to your working directory.

cd taxonomies/$TAXONOMY_NAME

4) Edit the configuration file.

vim hyperparameters.config.sh

5) Generate the run.sh file.

bash generate_run.sh

6) Launch the script.

To run locally, follow this example:

TRAIN_DATA=../../data/toy_example/patstat_train/
DEV_DATA=../../data/toy_example/patstat_dev
TEST_DATA=../../data/toy_example/patstat_test/
TEXT_COL="text"
LABEL_COL="ipc0"

bash run.sh train_files=$TRAIN_DATA dev_files=$DEV_DATA test_files=$TEST_DATA text_column=$TEXT_COL label_column=$LABEL_COL

To run on HPC, follow this example:

TRAIN_DATA=../../data/toy_example/patstat_train/
DEV_DATA=../../data/toy_example/patstat_dev
TEST_DATA=../../data/toy_example/patstat_test/
TEXT_COL="text"
LABEL_COL="ipc0"

sbatch launcher.sh train_files=$TRAIN_DATA dev_files=$DEV_DATA test_files=$TEST_DATA text_column=$TEXT_COL label_column=$LABEL_COL

Note that the run.sh script will be different based on the parameters given in the configuration file (e.g. if it is supposed to run locally or in hpc, the number of nodes, etc).

It goes without saying that in both cases you should adapt the paths and column names to your dataset, this is just an example that uses the toy dataset provided in /data.


Final note

The ./scripts directory contains the main code used to train classifiers, but there is no need to edit those files since every configurable parameter will be passed as an argument in the bash scripts from the working directory. The files from ./utils should not be modified either.

Contact Information

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101004870. H2020-SC6-GOVERNANCE-2018-2019-2020 / H2020-SC6-GOVERNANCE-2020

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published