Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

In this work, we explore the capabilities of novel transformers architecture, T5(Text-To-Text Transfer Transformer) to support Code-Related Tasks.

For all the details 👉 📄

Pipeline

In order to pre-train and then finetune a T5 small model, we need a new sentencepiece model to accommodate the expanded vocabulary given by the java programming language, abstracted java tokens, and technical natural language.

How to train a new SPmodel

Pythonic way

pip install sentencepiece
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=pretraining.txt --model_prefix=dl4se --vocab_size=32000 --bos_id=-1  --eos_id=1 --unk_id=2 --pad_id=0')

The new model has to be trained on the entire pre-training corpus.

Set up a GCS Bucket

To Set up a new GCS Bucket for training and fine-tuning a T5 Model, please follow the orignal guide provided by Google . Here the link: https://cloud.google.com/storage/docs/quickstart-console Subsequently, by following the jupyter notebook we provide for pre-train and fine-tune the network, you should be able to set up the final environment.
About the datasets

The datasets for the pre-training and the fine-tuning can be found here: https://drive.google.com/drive/folders/1uJv-kljY1Q59fa-TdkpXOOd9QEG5OZDa?usp=sharing
Pre-trainig/Fine-tuning

To pre-train and then, fine-tune T5, please use the script we provide here:
- Pre-Training
- Fine-Tuning
How to generate the predictions

First you need to convert the TF model into a pytorch model by using TF_to_Pytorch , then run Generate Results
Our results: https://drive.google.com/drive/folders/14ywfhJorNNeWxgSV1bI0XIzlLAFu8odH?usp=sharing

Additional: In Miscellaneous folder, you can find all the additional scripts we used for computing the BLEU score and the overlap metrics. Furthermore, here and here you can also experiment with our pre-trained and fine-tuned models.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Code		Code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Pipeline

How to train a new SPmodel

Set up a GCS Bucket

About the datasets

Pre-trainig/Fine-tuning

How to generate the predictions

Our results: https://drive.google.com/drive/folders/14ywfhJorNNeWxgSV1bI0XIzlLAFu8odH?usp=sharing

About

Releases 1

Packages

Languages

License

antonio-mastropaolo/T5-learning-ICSE_2021

Folders and files

Latest commit

History

Repository files navigation

Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks

Pipeline

How to train a new SPmodel

Set up a GCS Bucket

About the datasets

Pre-trainig/Fine-tuning

How to generate the predictions

Our results: https://drive.google.com/drive/folders/14ywfhJorNNeWxgSV1bI0XIzlLAFu8odH?usp=sharing

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages