- Deep bi-dirctional representations from unlabeled text by joining conditionally on both left and right context on all layers
- Pre-trained BERT can be fine-tuned with just one additional layer for a wide-range of tasks.
- Feature-based vs fine-tuning strategies for using pre-trained language representations
- feature based: ELMO
- fine tuning based: GPT
- This paper improves fine-tuning based approach by randomly masking some of the tokens from the input, so the objective is to predict the original vocabulary id of a masked word based only on its context
- Demostrate the importance of both next-sentence and previous-sentence
- Unsupervised feature-based approaches, ELMO
- Unsupervised fine tuning based approaches, GPT