Skip to content

Latest commit

 

History

History
30 lines (24 loc) · 2.83 KB

File metadata and controls

30 lines (24 loc) · 2.83 KB

NLP-BERT-for-Relation-Extraction

Forth Assignment in 'NLP - Natural Languages Processing' course by Prof. Yoav Goldberg, Prof. Ido Dagan and Prof. Reut Tsarfaty at Bar-Ilan University.

In this assignment, I implemented a Relation Extraction (RE) machine-learning based system, by using the Transfer Learning technique. That is, I picked up pre-trained BERT language model and fine-tuned it for few epochs on the given Relation Extraction dataset.

I chose to examine and compare two main fine-tuning techniques on BERT-Base-Uncased model with the Relation classification task using the provided dataset:

  1. The first technique is to train the entire architecture. That is, to further train the entire pre-trained model together with the additional task-specific layers on the dataset and feed the final output to a sigmoid layer. In this technique, the error is backpropagated through the entire architecture and the pre-trained weights of the model are updated based on the new dataset.
  2. The second technique is to train some layers while freezing others. That is, to train it partially – for example to keep the weights of the initial layers of the model frozen while retraining only the higher layers

To get to the bottom of this assignment, I tried about 8 different settings of the second finetuning technique in addition to experimenting with the first fine-tuning technique, in order to find the best fine-tuning settings of BERT for this task.

General approach:
For this work, I got inspired by the paper ‘Enriching Pre-trained Language Model with Entity Information for Relation Classification’ written by Shanchan Wu and Yifan He, and I decided to try and replicate the RBERT model introduced in this paper, while making the necessary changes in order to fit it to my binary classification task and the amount of available data I have got for it, hoping to achieve good performance as reported in the paper. (The figure below was taken from the paper but was edited by me, so that it will represent the architecture I actually implemented. I also changed some of the original notations for my convenience - I used these new notations in the code as well). image The final parameter settings:

  • Pretrained BERT: bert-base-uncased
  • Number of epochs: 8
  • Batch size: 1 (as the dataset is very small)
  • Activation: Tanh
  • Optimizer: AdamW
  • Learning rate: 2e-5
  • Dropout rate: 0.2
  • Loss Function: BCELoss (binary Cross Entropy loss)

Evaluation results: image Score: 100