Skip to content

EliottZemour/yelp-reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine tuning sequence-to-sequence models (BART, T5)

This repository contains a script taken from huggingface/transformers/tree/main/examples/pytorch/summarization

Data

You can find the yelp reviews dataset as json file here. The yelp_academic_dataset_review.json file is approximately 5Gb.

Setup

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Note

A GPU is recommended for training. If you don't have one, you can use a free GPU from Google Colab.

Fine tuning

The goal is to generate synthetic columns containing natural language with seq2seq models. Lets take as an example the following data (from Yelp), with the text column being the one we want to generate (conditionnaly on other columns values):

review id stars useful funny cool text
0 5 1 0 1 "Wow! Yummy, different, delicious. Our favorite is the lamb curry and korma. With 10 different kinds of naan!!! Don't let the outside deter you (because we almost changed our minds)...go in and try something new! You'll be glad you did!"

For training, what you need is a csv file with two columns: input and target. The input column contains the text to condition the generation on, and the target column contains the text to generate. In our case, the input column will contain the values of the other columns, and the target column will contain the text column.

input target
"Generate review: stars: 5, useful: 1, funny: 0, cool: 1" "Wow! Yummy, different, delicious. Our favorite is the lamb curry and korma. [...]"

Once you have your files train.csv and valid.csv containing these two str columns, symply launch the training script with the following command:

python train_seq2seq.py config.json

the config.json file contains the following:

{
    "model_name_or_path": "facebook/bart",
    "train_file": "train.csv",
    "valid_file": "valid.csv",
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "do_train": true,
    "do_eval": true,
    "push_to_hub": false,
    "output_dir": "bart-finetuning",
    "overwrite_output_dir": true,
    "num_train_epochs": 2,
    "evaluation_strategy": "steps",
    "eval_steps": 10,
    "save_strategy": "epoch",
    "warmup_steps": 200,
    "gradient_checkpointing": true,
    "learning_rate": 1e-5,
    "fp16": false
}

Feel free to modify it as you like. At the end of each epoch, a model checkpoint should be saved in the directory specified by output_dir.

Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = 'path_to_your_checkpoint'

model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)


text = "Generate review: stars: 5, useful: 1, funny: 0, cool: 1"
inputs = tokenizer(text, return_tensors='pt')

# you can play with the parameters and the sampling/decoding strategy here
out = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    num_beams=10,
    # do_sample=True,
    # temperature = 1.2,
    # top_p=0.8,
    num_return_sequences=5,
    length_penalty=5
)

gen_texts = []
for gen in gen_texts:
    gen_texts.append(tokenizer.decode(gen, skip_special_tokens=True))

for q in gen_texts:
    print(q)

Releases

No releases published

Packages

No packages published

Languages