Skip to content

SKTBrain/KoBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

95 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

KoBERT


Korean BERT pre-trained cased (KoBERT)

Why'?'

Training Environment

  • Architecture
predefined_args = {
        'attention_cell': 'multi_head',
        'num_layers': 12,
        'units': 768,
        'hidden_size': 3072,
        'max_length': 512,
        'num_heads': 12,
        'scaled': True,
        'dropout': 0.1,
        'use_residual': True,
        'embed_size': 768,
        'embed_dropout': 0.1,
        'token_type_vocab_size': 2,
        'word_embed': None,
    }
  • ํ•™์Šต์…‹
๋ฐ์ดํ„ฐ ๋ฌธ์žฅ ๋‹จ์–ด
ํ•œ๊ตญ์–ด ์œ„ํ‚ค 5M 54M
  • ํ•™์Šต ํ™˜๊ฒฝ
    • V100 GPU x 32, Horovod(with InfiniBand)

2019-04-29 ํ…์„œ๋ณด๋“œ ๋กœ๊ทธ

  • ์‚ฌ์ „(Vocabulary)
    • ํฌ๊ธฐ : 8,002
    • ํ•œ๊ธ€ ์œ„ํ‚ค ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ํ† ํฌ๋‚˜์ด์ €(SentencePiece)
    • Less number of parameters(92M < 110M )

Requirements

How to install

  • Install KoBERT as a python package

    pip install git+https://git@github.com/SKTBrain/KoBERT.git@master
  • If you want to modify source codes, please clone this repository

    git clone https://github.com/SKTBrain/KoBERT.git
    cd KoBERT
    pip install -r requirements.txt

How to use

PyTorch

Huggingface transformers API๊ฐ€ ํŽธํ•˜์‹  ๋ถ„์€ ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

>>> import torch
>>> from kobert import get_pytorch_kobert_model
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> model, vocab  = get_pytorch_kobert_model()
>>> sequence_output, pooled_output = model(input_ids, input_mask, token_type_ids)
>>> pooled_output.shape
torch.Size([2, 768])
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> sequence_output[0]
tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],
        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],
        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],
       grad_fn=<SelectBackward>)

model์€ ๋””ํดํŠธ๋กœ eval()๋ชจ๋“œ๋กœ ๋ฆฌํ„ด๋จ, ๋”ฐ๋ผ์„œ ํ•™์Šต ์šฉ๋„๋กœ ์‚ฌ์šฉ์‹œ model.train()๋ช…๋ น์„ ํ†ตํ•ด ํ•™์Šต ๋ชจ๋“œ๋กœ ๋ณ€๊ฒฝํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

  • Naver Sentiment Analysis Fine-Tuning with pytorch
    • Colab์—์„œ [๋Ÿฐํƒ€์ž„] - [๋Ÿฐํƒ€์ž„ ์œ ํ˜• ๋ณ€๊ฒฝ] - ํ•˜๋“œ์›จ์–ด ๊ฐ€์†๊ธฐ(GPU) ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
    • Open In Colab

ONNX

>>> import onnxruntime
>>> import numpy as np
>>> from kobert import get_onnx_kobert_model
>>> onnx_path = get_onnx_kobert_model()
>>> sess = onnxruntime.InferenceSession(onnx_path)
>>> input_ids = [[31, 51, 99], [15, 5, 0]]
>>> input_mask = [[1, 1, 1], [1, 1, 0]]
>>> token_type_ids = [[0, 0, 1], [0, 1, 0]]
>>> len_seq = len(input_ids[0])
>>> pred_onnx = sess.run(None, {'input_ids':np.array(input_ids),
>>>                             'token_type_ids':np.array(token_type_ids),
>>>                             'input_mask':np.array(input_mask),
>>>                             'position_ids':np.array(range(len_seq))})
>>> # Last Encoding Layer
>>> pred_onnx[-2][0]
array([[-0.24610452,  0.24282141,  0.25895312, ..., -0.48613444,
        -0.07305173,  0.07560554],
       [-0.24783179,  0.24200465,  0.25520486, ..., -0.4877185 ,
        -0.0727044 ,  0.07536091],
       [-0.24721591,  0.24196623,  0.2560626 , ..., -0.48743123,
        -0.07326943,  0.07650235]], dtype=float32)

ONNX ์ปจ๋ฒ„ํŒ…์€ soeque1๊ป˜์„œ ๋„์›€์„ ์ฃผ์…จ์Šต๋‹ˆ๋‹ค.

MXNet-Gluon

>>> import mxnet as mx
>>> from kobert import get_mxnet_kobert_model
>>> input_id = mx.nd.array([[31, 51, 99], [15, 5, 0]])
>>> input_mask = mx.nd.array([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = mx.nd.array([[0, 0, 1], [0, 1, 0]])
>>> model, vocab = get_mxnet_kobert_model(use_decoder=False, use_classifier=False)
>>> encoder_layer, pooled_output = model(input_id, token_type_ids)
>>> pooled_output.shape
(2, 768)
>>> vocab
Vocab(size=8002, unk="[UNK]", reserved="['[MASK]', '[SEP]', '[CLS]']")
>>> # Last Encoding Layer
>>> encoder_layer[0]
[[-0.24610372  0.24282135  0.2589539  ... -0.48613444 -0.07305248
   0.07560539]
 [-0.24783105  0.242005    0.25520545 ... -0.48771808 -0.07270523
   0.07536077]
 [-0.24721491  0.241966    0.25606337 ... -0.48743105 -0.07327032
   0.07650219]]
<NDArray 3x768 @cpu(0)>
  • Naver Sentiment Analysis Fine-Tuning with MXNet
    • Open In Colab

Tokenizer

>>> from gluonnlp.data import SentencepieceTokenizer
>>> from kobert import get_tokenizer
>>> tok_path = get_tokenizer()
>>> sp  = SentencepieceTokenizer(tok_path)
>>> sp('ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.')
['โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.']

Task Fine-tuning

Naver Sentiment Analysis

Model Accuracy
BERT base multilingual cased 0.875
KoBERT 0.901
KoGPT2 0.899

KoBERT์™€ CRF๋กœ ๋งŒ๋“  ํ•œ๊ตญ์–ด ๊ฐ์ฒด๋ช…์ธ์‹๊ธฐ

๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜์„ธ์š”:  SKTBrain์—์„œ KoBERT ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ด์ค€ ๋•๋ถ„์— BERT-CRF ๊ธฐ๋ฐ˜ ๊ฐ์ฒด๋ช…์ธ์‹๊ธฐ๋ฅผ ์‰ฝ๊ฒŒ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
len: 40, input_token:['[CLS]', 'โ–SK', 'T', 'B', 'ra', 'in', '์—์„œ', 'โ–K', 'o', 'B', 'ER', 'T', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต๊ฐœ', 'ํ•ด', '์ค€', 'โ–๋•๋ถ„์—', 'โ–B', 'ER', 'T', '-', 'C', 'R', 'F', 'โ–๊ธฐ๋ฐ˜', 'โ–', '๊ฐ', '์ฒด', '๋ช…', '์ธ', '์‹', '๊ธฐ๋ฅผ', 'โ–์‰ฝ๊ฒŒ', 'โ–๊ฐœ๋ฐœ', 'ํ• ', 'โ–์ˆ˜', 'โ–์žˆ์—ˆ๋‹ค', '.', '[SEP]']
len: 40, pred_ner_tag:['[CLS]', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'B-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'I-POH', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '[SEP]']
decoding_ner_sentence: [CLS] <SKTBrain:ORG>์—์„œ <KoBERT:POH> ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ด์ค€ ๋•๋ถ„์— <BERT-CRF:POH> ๊ธฐ๋ฐ˜ ๊ฐ์ฒด๋ช…์ธ์‹๊ธฐ๋ฅผ ์‰ฝ๊ฒŒ ๊ฐœ๋ฐœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.[SEP]

Korean Sentence BERT

Model Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhattan Pearson Manhattan Spearman Dot Pearson Dot Spearman
NLl 65.05 68.48 68.81 68.18 68.90 68.20 65.22 66.81
STS 80.42 79.64 77.93 77.43 77.92 77.44 76.56 75.83
STS + NLI 78.81 78.47 77.68 77.78 77.71 77.83 75.75 75.22

Release

  • v0.2.3
    • support onnx 1.8.0
  • v0.2.2
    • fix No module named 'kobert.utils'
  • v0.2.1
    • guide default 'import statements'
  • v0.2
    • download large files from aws s3
    • rename functions
  • v0.1.2
    • Guaranteed compatibility with higher versions of transformers
    • fix pad token index id
  • v0.1.1
    • ์‚ฌ์ „(vocabulary)๊ณผ ํ† ํฌ๋‚˜์ด์ € ํ†ตํ•ฉ
  • v0.1
    • ์ดˆ๊ธฐ ๋ชจ๋ธ ๋ฆด๋ฆฌ์ฆˆ

Contacts

KoBERT ๊ด€๋ จ ์ด์Šˆ๋Š” ์ด๊ณณ์— ๋“ฑ๋กํ•ด ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

License

KoBERT๋Š” Apache-2.0 ๋ผ์ด์„ ์Šค ํ•˜์— ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋ฐ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ผ์ด์„ ์Šค ๋‚ด์šฉ์„ ์ค€์ˆ˜ํ•ด์ฃผ์„ธ์š”. ๋ผ์ด์„ ์Šค ์ „๋ฌธ์€ LICENSE ํŒŒ์ผ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.