refer to data/collate_data.ipynb
, utils/train_tokenizer.py
refer to data/collate_data.ipynb
-
Maked Language Modeling
- 15% of the total (80% : making, 10% : random, 10% : origin)
-
Next Sentence Prediction
- 50% (0 : not next sentence, 1 : next sentence)
python run_pretraining.py --c config.json --cont --checkpoint results/1000-step
-
--config_path
: config file (default : './config.json') -
--continuous
: boolean for continuous training -
--checkpoint
: path of checkpoint for continous training
Evaluate refer to evaluate_script.py