This is a package to fine-tune language models in order to create clustering-friendly embeddings. It is based on the paper Supervised clustering loss for clustering-friendly sentence embeddings: An application to intent clustering, (2023) G Barnabo, A Uva, S Pollastrini, C Rubagotti, D Bernardi
@Inproceedings{Barnabo2023,
author = {Giorgio Barnabo and Antonio Uva and Sandro Pollastrini and Chiara Rubagotti and Davide Bernardi},
title = {Supervised clustering loss for clustering-friendly sentence embeddings: An application to intent clustering},
year = {2023},
url = {https://www.amazon.science/publications/supervised-clustering-loss-for-clustering-friendly-sentence-embeddings-an-application-to-intent-clustering},
booktitle = {IJCNLP-AACL 2023},
}
p-lightning-template
| conf # contains Hydra config files
| data
| model
| train
root.yaml # hydra root config file
| data # datasets should go here
| experiments # where the models are stored
| src
| pl_data_modules.py # base LightinigDataModule
| pl_modules.py # base LightningModule
| train.py # main script for training the network
| README.md
| requirements.txt
| setup.sh # environment setup script
The structure of the repository is very simplistic and involves mainly four components:
- pl_data_modules.py where you can declare your LightningDataModules.
- pl_modules.py where you can declare your LightningModules.
- train.py the main script to start the training phase.
- conf the directory containing the config files of Hydra.
In order to set up the python interpreter we utilize conda
, the script setup.sh
creates a conda environment and install pytorch
and the dependencies in "requirements.txt".
To use this repository as a starting template for your projects, you can just click the green button "Use this template" at the top of this page. More on using GitHub repositories on the following link.
Q: When I run any script using a Hydra config I can see that relative paths do not work. Why?
A: Whenever you run a script that uses a Hydra config, Hydra will create a new working directory (specified in the root.yaml file) for you. Every relative path you will use will start from it, and this is why you get the 'FileNotFound' error. However, using a different working directory for each of your experiments has a couple of benefits that you can read in the Hydra documentation for the Working directory. There are several ways that hydra offers as a workaround for this problem here we will report the two that the authors of this repository use the most, but you can find the other on the link we previously mentioned:
-
You can use the 'hydra.utils.to_absolute_path' function that will convert every relative path starting from your working directory (p-lightning-template in this project) to a full path that will be accessible from inside the new working dir.
-
Hydra will provide you with a reference to the original working directory in your config files. You can access it under the name of 'hydra:runtime.cwd'. So, for example, if your training dataset has the relative path 'data/train.tsv' you can convert it to a full path by prepending the hydra variable before
Contributions are always more than welcome, the only thing to take into account when submitting a pull request is that we utilize the Black code formatter with a max length for the code of 120. More pragmatically you should ensure to utilize the command "black -l 120" on the whole "src" directory before pushing the code.
This repository has been created with the idea of providing a simple skeleton from which you can start a PyTorch Lightning project. Instead of favoring the customizability, we favored the simplicity and we intended this template as a base for building more specific templates based on the user needs (for example by forking this one). However, there are several other repositories with different features that you can check if interested. We will list two of them here:
- lucmos/nn-template: a very nice template with support for DVC.
- hobogalaxy/lightning-hydra-template: another useful and very well documented template repository.
You need to have conda installed. Please refer to the conda installation page. The miniconda version is sufficient.
After conda installation, run the following script
sh setup.sh
Download a sentence-transformer model by selecting is name here: HuggingFace Sentence-Transformer.
E.g., suppose that you selected paraphrase-multilingual-MiniLM-L12-v2
, the run this code.
Run:
python download_base_model.py <name-of-the-model>
It will download it in the folder base_language_models.
when we tested our model we used the following four base sentence encoders base_language_model_folder:
bert-base-multilingual-cased
xlm-roberta-base
sentence-transformers_all-mpnet-base-v2
paraphrase-multilingual-mpnet-base-v2
We suggest to use paraphrase-multilingual-mpnet-base-v2
, which gives good performances even without fine-tuning.
To fine-tune any of the 4 base sentence encoders you should follow these steps.
- given some utterances that you want to use for fine-tuning, create train, dev and test sets such that each intent is represented in only one of the three splits; utterances can come from different domains but in that case some intents from each domain should be present in each split;
- for example, let’s assume that you have utterances with the following intents: "CreateAccount", "ChangeAddress", "ResetPassword", "RemoveDependent", "ReportAutomobileAccident", "AddDependent", "ReportBillingIssue", "UpdatePaymentPreference", "CancelPlan", "RequestProofOfInsurance"
- you could first randomly assign 33% of these intents to the train set, 33% to the dev set and 33% to the test set and then place all the utterances with these intents in the corresponding dataset split;
- upload your dataset splits in the data folder
supervised-intent-clustering/data/New_Fine_Tuning_Dataset
. Each file (train, dev, test) should come in the form of a csv file with the following columns:dataset,utterance_id,utterance_split,utterance_lang,utterance_text,utterance_intent
- the folder already contains three sample splits that can be open to better understand how to create the necessary datasets
- modify the dataset-specific hyper-parameters in the corresponding file
supervised-intent-clustering/conf/dataset_specific_hyperparams/new_fine_tuning_dataset.yaml
- language_list: a list representing a subset of language identifiers that you used in the creation of your datasets; if your datasets contain many languages you can perform training only on a subset of those;
- modality: a string that you can add to the named of the fine-tuned model; you can leave it blank
- intent_classes_per_batch: a number between 2 and min(#train_intents, #dev_intents, #test_intents); the greater the better;
- samples_per_class_per_batch: how many utterances for each intents to add in each batch; ideally the total number of utterances per batch (intent_classes_per_batch*samples_per_class_per_batch) should be around 150; if not enough utterances for an intent are found the available ones are randomly sampled with repetition;
- remark:
intent_classes_per_batch*samples_per_class_per_batch
should not be greater than the number of examples in train, dev or test, otherwise the errorIndexError: list index out of range
will be raised.
- remark:
- limit_train_batches: number of batches per training epoch - we suggest a small number between 4 and 10;
- limit_val_batches: number of batches used to collect validation metrics - we suggest a small number between 4 and 10;
- finally, to launch the fine-tuning you just need to run the following command:
PYTHONPATH=. python3 ./src/train.py dataset_specific_hyperparams=new_fine_tuning_dataset model.base_model_name=<the_base_model_you_want_to_fine_tune> model.training_objective=<the_training_objective_you_want_to_use``>
- model.base_model_name: when we tested our model we used the following four base sentence encoders base_language_model_folder:
bert-base-multilingual-cased
xlm-roberta-base
sentence-transformers_all-mpnet-base-v2
paraphrase-multilingual-mpnet-base-v2
- We suggest using
sentence-transformers_all-mpnet-base-v2
for English-only datasets andparaphrase-multilingual-mpnet-base-v2
for the multilingual ones - model.training_objective: you can choose among the following losses:
- contrastive_learning
- cosine_similarity_loss
- triplet_margin_loss
- supervised_learning
- binary_classification
- We suggest using either the
supervised_learning
loss or thetriplet_margin_loss
. Default hyper-parameters should work just fine.
- model.base_model_name: when we tested our model we used the following four base sentence encoders base_language_model_folder:
- when the training finishes due to early stopping, you will find your fine-tuned model in the corresponding folder
supervised-intent-clustering/fine_tuned_language_models
. The produce sentence encoder can be directly used with the Hugging Face Sentence Encoder Library:-
from sentence_transformers import SentenceTransformer
def get_sentence_embeddings(list_of_sentences: List[str], hf_model_name_or_path: str=<path_to_the_folder_where_the_model_was_saved>):
model = SentenceTransformer(hf_model_name_or_path) sentence_embeddings = model.encode( list_of_sentences, batch_size=64, show_progress_bar=True) return sentence_embeddings
-
- For each experiment you run, key performance metrics will be logged in the file
supervised-intent-clustering/experiment_metrics.txt.
In particular, you want the PRAUC after training to be higher then the PRAUC before training on both the train, dev and test sets. If you run multiple experiments, you can turn theexperiment_metrics.txt
file into a more readable .csv file running the following command:PYTHONPATH=. python3 ./src/
training_log_post_processing.py
- To further customise the training behaviour, you can manually modify the following files:
supervised-intent-clustering/conf/model/default_model.yaml
supervised-intent-clustering/conf/train/default_train.yaml
supervised-intent-clustering/conf/data/default_data.yaml
supervised-intent-clustering/src/train.py