In this work, we explore the capabilities of novel transformers architecture, T5(Text-To-Text Transfer Transformer) to support Code-Related Tasks.
For all the details 👉 📄
In order to pre-train and then finetune a T5 small model, we need a new sentencepiece model to accommodate the expanded vocabulary given by the java programming language, abstracted java tokens, and technical natural language.
-
How to train a new SPmodel
Pythonic way
pip install sentencepiece import sentencepiece as spm spm.SentencePieceTrainer.train('--input=pretraining.txt --model_prefix=dl4se --vocab_size=32000 --bos_id=-1 --eos_id=1 --unk_id=2 --pad_id=0')
The new model has to be trained on the entire pre-training corpus.
-
To Set up a new GCS Bucket for training and fine-tuning a T5 Model, please follow the orignal guide provided by Google . Here the link: https://cloud.google.com/storage/docs/quickstart-console Subsequently, by following the jupyter notebook we provide for pre-train and fine-tune the network, you should be able to set up the final environment.
-
The datasets for the pre-training and the fine-tuning can be found here: https://drive.google.com/drive/folders/1uJv-kljY1Q59fa-TdkpXOOd9QEG5OZDa?usp=sharing
-
To pre-train and then, fine-tune T5, please use the script we provide here:
-
First you need to convert the TF model into a pytorch model by using TF_to_Pytorch , then run Generate Results
Additional: In Miscellaneous folder, you can find all the additional scripts we used for computing the BLEU score and the overlap metrics. Furthermore, here and here you can also experiment with our pre-trained and fine-tuned models.