Skip to content

Latest commit

 

History

History
22 lines (18 loc) · 1.07 KB

BERT.md

File metadata and controls

22 lines (18 loc) · 1.07 KB

Key ideas

  • Deep bi-dirctional representations from unlabeled text by joining conditionally on both left and right context on all layers
  • Pre-trained BERT can be fine-tuned with just one additional layer for a wide-range of tasks.

Introduction

  • Feature-based vs fine-tuning strategies for using pre-trained language representations
    • feature based: ELMO
    • fine tuning based: GPT
  • This paper improves fine-tuning based approach by randomly masking some of the tokens from the input, so the objective is to predict the original vocabulary id of a masked word based only on its context
  • Demostrate the importance of both next-sentence and previous-sentence

Related work

  • Unsupervised feature-based approaches, ELMO
  • Unsupervised fine tuning based approaches, GPT

BERT

  • Two steps:
    • pre-training: training on different unlabeled data for pre-training tasks
    • fine-tuning: training using data from downstream tasks