Token classification using Phobert Models for 🇻🇳Vietnamese
Get started in seconds with verified environments. Run script below for install all dependencies
bash ./install_dependencies.sh
The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.
Word | POS | Chunk | NER |
---|---|---|---|
Dương | Np | B-NP | B-PER |
là | V | B-VP | O |
một | M | B-NP | O |
chủ | N | B-NP | O |
cửa hàng | N | B-NP | O |
lâu | A | B-AP | O |
năm | N | B-NP | O |
ở | E | B-PP | O |
Hà Nội | Np | B-NP | B-LOC |
. | CH | O | O |
The dataset must put on directory with structure as below.
├── data_dir
| └── train.txt
| └── dev.txt
| └── test.txt
The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release
python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data
or
bash ./train.sh
Arguments:
- type (
str
,*required
): What is process type to be run. Must in [train
,test
,predict
,demo
].- task (
str
,*optional
): Training task selected in the list: [vlsp2016
,vlsp2018_l1
,vlsp2018_l2
,vlsp2018_join
]. Default:vlsp2016
- data_dir (
Union[str, os.PathLike]
,*required
): The input data dir. Should contain the .csv files (or other data files) for the task.- overwrite_data (
bool
,*optional
) : Whether not to overwirte splitted dataset. Default=False- load_weights (
Union[str, os.PathLike]
,*optional
): Path of pretrained file.- model_name_or_path (
str
,*required
): Pre-trained model selected in the list: [vinai/phobert-base
,vinai/phobert-large
,...] Default=vinai/phobert-base
- model_arch (
str
,*required
): Punctuation prediction model architecture selected in the list: [softmax
,crf
,lstm_crf
].- output_dir (
Union[str, os.PathLike]
,*required
): The output directory where the model predictions and checkpoints will be written.- max_seq_length (
int
,*optional
): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.- train_batch_size (
int
,*optional
): Total batch size for training. Default=32.- eval_batch_size (
int
,*optional
): Total batch size for eval. Default=32.- learning_rate (
float
,*optional
): The initial learning rate for Adam. Default=1e-4.- classifier_learning_rate (
float
,*optional
): The initial classifier learning rate for Adam. Default=5e-4.- epochs (
float
,*optional
): Total number of training epochs to perform. Default=100.0.- weight_decay (
float
,*optional
): Weight deay if we apply some. Default=0.01.- adam_epsilon (
float
,*optional
): Epsilon for Adam optimizer. Default=5e-8.- max_grad_norm (
float
,*optional
): Max gradient norm. Default=1.0.- early_stop (
float
,*optional
): Number of early stop step. Default=10.0.- no_cuda (
bool
,*optional
): Whether not to use CUDA when available. Default=False.- run_test (
bool
,*optional
): Whether not to run evaluate best model on test set after train. Default=False.- seed (
bool
,*optional
): Random seed for initialization. Default=42.- num_workers (
int
,*optional
): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.- save_step (
int
,*optional
): The number of steps in the model will be saved. Default=10000.- gradient_accumulation_steps (
int
,*optional
): Number of updates steps to accumulate before performing a backward/update pass. Default=1.
The command below start Tensorboard help you follow fine-tune process.
tensorboard --logdir runs --host 0.0.0.0 --port=6006
All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.
Click to expand!
Model | BIO-Metrics | NE-Metrics | Log | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
Bert-base-multilingual-cased [1] | Softmax | 0.9905 | 0.9239 | 0.8776 | 0.8984 | 0.9068 | 0.9905 | 0.8938 | 0.8941 | 0.8939 |
Maxtrix
Log |
CRF | 0.9903 | 0.9241 | 0.8880 | 0.9048 | 0.9087 | 0.9903 | 0.8951 | 0.8945 | 0.8948 |
Maxtrix
Log |
|
LSTM_CRF | 0.9905 | 0.9183 | 0.8898 | 0.9027 | 0.9178 | 0.9905 | 0.8879 | 0.8992 | 0.8935 |
Maxtrix
Log |
|
PhoBert-base [2] | Softmax | 0.9950 | 0.9312 | 0.9404 | 0.9348 | 0.9570 | 0.9950 | 0.9434 | 0.9466 | 0.9450 |
Maxtrix
Log |
CRF | 0.9949 | 0.9497 | 0.9248 | 0.9359 | 0.9525 | 0.9949 | 0.9516 | 0.9456 | 0.9486 |
Maxtrix
Log |
|
LSTM_CRF | 0.9949 | 0.9535 | 0.9181 | 0.9349 | 0.9456 | 0.9949 | 0.9520 | 0.9396 | 0.9457 |
Maxtrix
Log |
|
viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Click to expand!
Model | BIO-Metrics | NE-Metrics | Epoch | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
Bert-base-multilingual-cased [1] | Softmax | 0.9828 | 0.7421 | 0.7980 | 0.7671 | 0.8510 | 0.9828 | 0.7302 | 0.8339 | 0.7786 |
Maxtrix
Log |
CRF | 0.9824 | 0.7716 | 0.7619 | 0.7601 | 0.8284 | 0.9824 | 0.7542 | 0.8127 | 0.7824 |
Maxtrix
Log |
|
LSTM_CRF | 0.9829 | 0.7533 | 0.7750 | 0.7626 | 0.8296 | 0.9829 | 0.7612 | 0.8122 | 0.7859 |
Maxtrix
Log |
|
PhoBert-base [2] | Softmax | 0.9896 | 0.7970 | 0.8404 | 0.8170 | 0.8892 | 0.9896 | 0.8421 | 0.8942 | 0.8674 |
Maxtrix
Log |
CRF | 0.9903 | 0.8124 | 0.8428 | 0.8260 | 0.8834 | 0.9903 | 0.8695 | 0.8943 | 0.8817 |
Maxtrix
Log |
|
LSTM+CRF | 0.9901 | 0.8240 | 0.8278 | 0.8241 | 0.8715 | 0.9901 | 0.8671 | 0.8773 | 0.8721 |
Maxtrix
Log |
|
viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Click to expand!
Model | BIO-Metrics | NE-Metrics | Epoch | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
Bert-base-multilingual-cased [1] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
PhoBert-base [2] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM+CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Click to expand!
Model | BIO-Metrics | NE-Metrics | Epoch | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Precision | Recall | F1-score | Accuracy (w/o 'O') |
Accuracy | Precision | Recall | F1-score | |||
Bert-base-multilingual-cased [1] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
PhoBert-base [2] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM+CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
viBERT [3] | Softmax | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
LSTM_CRF | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
[1] Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
[2] Nguyen, D. Q., & Nguyen, A. T. (2020, November). PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 1037-1042).
[3] The, V. B., Thi, O. T., & Le-Hong, P. (2020). Improving sequence tagging for vietnamese text using transformer-based neural models. arXiv preprint arXiv:2006.15994.
The command below load your fine-tuned model and inference in your text input.
python main.py predict --model_path outputs/best_model.pt
Arguments:
- type (
str
,*required
): What is process type to be run. Must in [train
,test
,predict
,demo
].- model_path (
Union[str, os.PathLike]
,*optional
): Path of pretrained file.- no_cuda (
bool
,*optional
): Whether not to use CUDA when available. Default=False.
The command below load your fine-tuned model and start demo page.
python main.py demo --model_path outputs/best_model.pt
Arguments:
- type (
str
,*required
): What is process type to be run. Must in [train
,test
,predict
,demo
].- model_path (
Union[str, os.PathLike]
,*optional
): Path of pretrained file.- no_cuda (
bool
,*optional
): Whether not to use CUDA when available. Default=False.
Pretrained model Phobert by VinAI Research and Pytorch implementation by Hugging Face.