- PERT: Pre-Training BERT with Permuted Language Model
- Yiming Cui, Ziqing Yang, Ting Liu
Chinese LERT | Chinese/English PERT Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | TextBrewer | TextPruner
View more resources released by HFL: https://github.com/ymcui/HFL-Anthology
Mar 28, 2023 We open-sourced Chinese LLaMA&Alpaca LLMs, which can be quickly deployed on PC. Check: https://github.com/ymcui/Chinese-LLaMA-Alpaca
2022/10/29 We release a new pre-trained model called LERT, check https://github.com/ymcui/LERT/
May 17, 2022 We release the PERT models that were finetuned on machine reading comprehension data with interactive demos, check: Download
Mar 15, 2022 Our preliminary technical report is available on arXiv: https://arxiv.org/abs/2203.06906
Feb 24, 2022 Chinese and English PERT-base and PERT-large have been released. The BERT structure can be directly loaded and fine-tuned for downstream tasks. The technical report will be issued after it is perfected. The time is expected to be in mid-March. Thank you for your patience.
Feb 17, 2022 Thank you for your attention to this project. It is expected that the model will be issued next week, and the technical report will be issued after it is improved.
Chapter | Description |
---|---|
Introduction | The basic principle of PERT |
Download | Download pre-trained PERT |
QuickLoad | How to use 🤗Transformers to quickly load models |
Baseline Performance | Baseline system performances on some NLU tasks |
FAQ | Frequently Asked Questions |
Citation | Technical report of this project |
The learning of pre-trained models for natural language understanding (NLU) falls broadly into two categories: input text with or without the masking token [MASK].
The main motivation of this work is quite interesting, which is based on a usual phenomenon: a certain degree of permuted text does not affect comprehension. So is it possible to learn semantic knowledge from the permuted text?
General idea: PERT utilizes permuted text as the input (so no [MASK] tokens are introduced). The learning objective of PERT is to predict the location of the original token. Please take a look at the following example.
The model weights of TensorFlow 1.15 are mainly provided here. For models in PyTorch or TensorFlow2, see the next section.
The open source version only contains the weights of the Transformer part, which can be directly used for fine-tuning of downstream tasks. Also you can further pre-train this model with any pre-training objective as long as it uses traditional transformer architecture as the main body. For more instructions, see FAQ.
PERT-large
: 24-layer, 1024-hidden, 16-heads, 330M parametersPERT-base
12-layer, 768-hidden, 12-heads, 110M parameters
Model | Language | Corpus | Google Download | Baidu Disk Download |
---|---|---|---|---|
Chinese-PERT-large | Chinese | EXT data [1] | TensorFlow | TensorFlow (password: e9hs) |
Chinese-PERT-base | Chinese | EXT data [1] | TensorFlow | TensorFlow (password: rcsw) |
English-PERT-large (uncased) | English | WikiBooks[2] | TensorFlow | TensorFlow (password: wxwi) |
English-PERT-base (uncased) | English | WikiBooks[2] | TensorFlow | TensorFlow (password: 8jgq) |
[1] EXT data includes: Chinese Wikipedia, encyclopedias, news, question answering web, etc. The total number of words is 5.4B, taking about 20G of disk space, which is the same as MacBERT. [2] Wikipedia + BookCorpus
Take the TensorFlow version of Chinese-PERT-base
as an example. The zip archive contains the following files:
chinese_pert_base_L-12_H-768_A-12.zip
|- pert_model.ckpt # model weights
|- pert_model.meta # model meta information
|- pert_model.index # model index information
|- pert_config.json # model parameters
|- vocab.txt # Vocabulary (same as original vocabulary of Google's BERT-base-Chinese)
Among them, bert_config.json
and vocab.txt
are exactly the same as Google's original BERT-base, Chinese
(the English version is the same as the BERT-uncased version).
TensorFlow (v2) and PyTorch version models can be downloaded through the 🤗transformers model library.
Download method: Click on any model to be downloaded → select the "Files and versions" tab → download the corresponding model file.
Model | Model File Size | Transformers ModelHub URL |
---|---|---|
Chinese-PERT-large | 1.2G | https://huggingface.co/hfl/chinese-pert-large |
Chinese-PERT-base | 0.4G | https://huggingface.co/hfl/chinese-pert-base |
Chinese-PERT-large-MRC | 1.2G | https://huggingface.co/hfl/chinese-pert-large-mrc |
Chinese-PERT-base-MRC | 0.4G | https://huggingface.co/hfl/chinese-pert-base-mrc |
English-PERT-large | 1.2G | https://huggingface.co/hfl/english-pert-large |
English-PERT-base | 0.4G | https://huggingface.co/hfl/english-pert-base |
Since the main body of PERT is still the same as the BERT structure, users can easily call the PERT model using the transformers library.
**Note: All PERT models in this project should be loaded by using BertTokenizer and BertModel (BertForQuestionAnswering for MRC models). **
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")
The list of MODEL_NAME
is as follows:
Model name | MODEL_NAME |
---|---|
Chinese-PERT-large | hfl/chinese-pert-large |
Chinese-PERT-base | hfl/chinese-pert-base |
Chinese-PERT-large-MRC | hfl/chinese-pert-large-mrc |
Chinese-PERT-base-MRC | hfl/chinese-pert-base-mrc |
English-PERT-large | hfl/english-pert-large |
English-PERT-base | hfl/english-pert-base |
For detailed performance, please see: https://arxiv.org/abs/2203.06906
We report both average score (in brackets) and maximum.
We perform experiments on the following ten Chinese tasks.
- Machine Reading Comprehension (2):CMRC 2018 (Simplified Chinese)、DRCD (Traditional Chinese)
- Text Classification (6):
- Named Entity Recognition (NER) (2):MSRA-NER、People's Daily (人民日报)
Besides, we also carried out experiments on the word order recovery task, which is a part of the text correction.
We perform experiments on the following six English tasks.
- Machine Reading Comprehension (2):SQuAD 1.1、SQuAD 2.0
- GLUE Tasks (4):MNLI、SST-2、CoLA、MRPC
Q1: About the open-source version of PERT
A1: The open source version only contains the weights of the Transformer part, which can be directly used for fine-tuning of downstream tasks, or for the initialization of re-pre-training for other models. The original TF version weights may contain randomly initialized MLM weights (Please do not try to use these part). There are two reasons:
- To remove unnecessary Adam-related weights (the model size will be shrinked to its 1/3);
- Consistent with the BERT model conversion of transformers (this process will use the original BERT structure, so the weights of the pre-training task part will be lost, and the MLM random initialization weights of BERT will be retained).
Q2: About the effect of PERT on downstream tasks
A2: The preliminary conclusion is that the effect is better in tasks such as reading comprehension and sequence labeling, but the effect is poor in text classification tasks. Please try the specific effects on your own tasks. For more information, please read our paper: https://arxiv.org/abs/2203.06906
Please cite our paper if you find the resource or model useful. https://arxiv.org/abs/2203.06906
@article{cui2022pert,
title={PERT: Pre-training BERT with Permuted Language Model},
author={Cui, Yiming and Yang, Ziqing and Liu, Ting},
year={2022},
eprint={2203.06906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Follow our official WeChat account to keep updated with our latest technologies!
If you have questions, please submit them in a GitHub Issue.
- You are advised to read FAQ first before you submit an issue.
- Repetitive and irrelevant issues will be ignored and closed by [stable-bot](stale · GitHub Marketplace). Thank you for your understanding and support.
- We cannot acommodate EVERY request, and thus please bare in mind that there is no guarantee that your request will be met.
- Always be polite when you submit an issue.