ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.
For a detailed description and experimental results, please refer to the ICLR 2020 paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
COMPOSITIONAL EMBEDDINGS USING COMPLEMENTARY PARTITIONS is a relatively novel approach for reducing the embedding size in an end-to-end fashion by exploiting complementary partitions of the category set to produce a unique embedding vector for each category without explicit definition. It is an effective memory efficient technique to deal with models that have massive vocabulary or high cardinality which can result in bottlenecks in the training process. The authors show that the information loss through the generated complementary embeddings are minimal compared to actual embeddings. This quotient-remainder trick used in the paper is more effective compared to the previous hashing trick.
For a detailed description and experimental results, please refer to the paper Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems.
This repository contains code to pre-train ELECTRA with an option to use memory efficient compositional embeddings for datasets that have huge vocabulary sizes. The repository currently only supports text data from CSV files on a single GPU. An example of fine-tuning Electra on a sentiment classification task is provided. This code is ideal for researchers to use on non English datasets, recommendation engine data, MIDI files, etc.
Use pretraining.py
to pre-train an ELECTRA model. It has the following arguments:
--raw_data_loc
(optional): raw data location of csv file containing text sentences.--col_name
(optional): name of text column in dataset to use for pretraining.--working_dir
(optional): location of directory to store model weights, configs and vocabulary tokens.--hparams
(optional): a dict containing model hyperparameters. SeePretraining_Config.py
under Configs folder for the supported default hyperparameters. To override any of the default hyperparameters, pass those as a dictionary. For example:--hparams {"hparam1": value1, "hparam2": value2, ...}
.
To see an example notebook of pretraining the Electra model, see Pretraining.ipynb
and Pretraining_Compositional_Embeddings.ipynb
.
Use FineTuning.py
to fine tune the pre-trained ELECTRA model for a sentiment classification task. It has the following arguments:
--raw_data_loc
(optional): raw data location of csv file containing text sentences.--working_dir
(optional): location of directory to load the pretrained model weights, configs and vocabulary tokens.--hparams
(optional): a dict containing model hyperparameters. SeeFinetuning_Config.py
under Configs folder for the supported default hyperparameters. To override any of the default hyperparameters, pass those as a dictionary. For example:--hparams {"hparam1": value1, "hparam2": value2, ...}
.
To see an example notebook of fine tuning the Electra model, see FineTuning.ipynb
and FineTuning_Compositional_Embeddings.ipynb
.
Fork this repository and run the pretraining example (instructions above) on the data provided to familiarize yourself with the repository and parameters. To use on your own data, create a csv file with a column of text. Feel free to change the code as needed to support your own requirements.
For issues related to the repository, please raise a GitHub issue or contact me at keshavbhandari@gmail.com
Please star the repository if you find it useful! Thanks :)