Pretraining BERT(DistilBERT) with PyTorch and Huggingface

Step 1. Collect data to train Tokenizer and BERT

refer to data/collate_data.ipynb, utils/train_tokenizer.py

refer to data/collate_data.ipynb

Maked Language Modeling
- 15% of the total (80% : making, 10% : random, 10% : origin)
Next Sentence Prediction
- 50% (0 : not next sentence, 1 : next sentence)

python run_pretraining.py --c config.json --cont --checkpoint results/1000-step

Evaluate refer to evaluate_script.py

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
config		config
data		data
models		models
trainer		trainer
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
evaluate_script.py		evaluate_script.py
run_pretraining.py		run_pretraining.py