This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。
- Mu Yang at CKIP (Author & Maintainer).
- Wei-Yun Ma at CKIP (Maintainer).
- CkipTagger: An alternative Chinese NLP library with using BiLSTM.
- CKIP CoreNLP Toolkit: A Chinese NLP library with more NLP tasks and utilities.
You may also use our pretrained models with HuggingFace transformers library directly: https://huggingface.co/ckiplab/.
您可於 https://huggingface.co/ckiplab/ 下載預訓練的模型。
- Language Models
- ALBERT Tiny:
ckiplab/albert-tiny-chinese
- ALBERT Base:
ckiplab/albert-base-chinese
- BERT Tiny:
ckiplab/bert-tiny-chinese
- BERT Base:
ckiplab/bert-base-chinese
- GPT2 Tiny:
ckiplab/gpt2-tiny-chinese
- GPT2 Base:
ckiplab/gpt2-base-chinese
- ALBERT Tiny:
- NLP Task Models
- ALBERT Tiny — Word Segmentation:
ckiplab/albert-tiny-chinese-ws
- ALBERT Tiny — Part-of-Speech Tagging:
ckiplab/albert-tiny-chinese-pos
- ALBERT Tiny — Named-Entity Recognition:
ckiplab/albert-tiny-chinese-ner
- ALBERT Base — Word Segmentation:
ckiplab/albert-base-chinese-ws
- ALBERT Base — Part-of-Speech Tagging:
ckiplab/albert-base-chinese-pos
- ALBERT Base — Named-Entity Recognition:
ckiplab/albert-base-chinese-ner
- BERT Tiny — Word Segmentation:
ckiplab/bert-tiny-chinese-ws
- BERT Tiny — Part-of-Speech Tagging:
ckiplab/bert-tiny-chinese-pos
- BERT Tiny — Named-Entity Recognition:
ckiplab/bert-tiny-chinese-ner
- BERT Base — Word Segmentation:
ckiplab/bert-base-chinese-ws
- BERT Base — Part-of-Speech Tagging:
ckiplab/bert-base-chinese-pos
- BERT Base — Named-Entity Recognition:
ckiplab/bert-base-chinese-ner
- ALBERT Tiny — Word Segmentation:
You may use our model directly from the HuggingFace's transformers library.
您可直接透過 HuggingFace's transformers 套件使用我們的模型。
pip install -U transformers
Please use BertTokenizerFast as tokenizer, and replace
ckiplab/albert-tiny-chinese
and ckiplab/albert-tiny-chinese-ws
by any model you need in the following example.請使用內建的 BertTokenizerFast,並將以下範例中的
ckiplab/albert-tiny-chinese
與 ckiplab/albert-tiny-chinese-ws
替換成任何您要使用的模型名稱。from transformers import (
BertTokenizerFast,
AutoModelForMaskedLM,
AutoModelForCausalLM,
AutoModelForTokenClassification,
)
# masked language model (ALBERT, BERT)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForMaskedLM.from_pretrained('ckiplab/albert-tiny-chinese') # or other models above
# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above
# nlp task model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-tiny-chinese-ws') # or other models above
To fine tunning our model on your own datasets, please refer to the following example from HuggingFace's transformers.
您可參考以下的範例去微調我們的模型於您自己的資料集。
- https://github.com/huggingface/transformers/tree/master/examples
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling
- https://github.com/huggingface/transformers/tree/master/examples/pytorch/token-classification
Remember to set
--tokenizer_name bert-base-chinese
in order to use Chinese tokenizer.記得設置
--tokenizer_name bert-base-chinese
以正確的使用中文的 tokenizer。python run_mlm.py \
--model_name_or_path ckiplab/albert-tiny-chinese \ # or other models above
--tokenizer_name bert-base-chinese \
...
python run_ner.py \
--model_name_or_path ckiplab/albert-tiny-chinese-ws \ # or other models above
--tokenizer_name bert-base-chinese \
...
The following is a performance comparison between our model and other models.
The results are tested on a traditional Chinese corpus.
以下是我們的模型與其他的模型之性能比較。
各個任務皆測試於繁體中文的測試集。
Model | #Parameters | Perplexity† | WS (F1)‡ | POS (ACC)‡ | NER (F1)‡ |
---|---|---|---|---|---|
ckiplab/albert-tiny-chinese | 4M | 4.80 | 96.66% | 94.48% | 71.17% |
ckiplab/albert-base-chinese | 11M | 2.65 | 97.33% | 95.30% | 79.47% |
ckiplab/bert-tiny-chinese | 12M | 8.07 | 96.98% | 95.11% | 74.21% |
ckiplab/bert-base-chinese | 102M | 1.88 | 97.60% | 95.67% | 81.18% |
ckiplab/gpt2-tiny-chinese | 4M | 16.94 | -- | -- | -- |
ckiplab/gpt2-base-chinese | 102M | 8.36 | -- | -- | -- |
voidful/albert_chinese_tiny | 4M | 74.93 | -- | -- | -- |
voidful/albert_chinese_base | 11M | 22.34 | -- | -- | -- |
bert-base-chinese | 102M | 2.53 | -- | -- | -- |
† Perplexity; the smaller the better.
† 混淆度;數字越小越好。
‡ WS: word segmentation; POS: part-of-speech; NER: named-entity recognition; the larger the better.
‡ WS: 斷詞;POS: 詞性標記;NER: 實體辨識;數字越大越好。
The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
以上的語言模型訓練於 ZhWiki 與 CNA 資料集上;斷詞(WS)與詞性標記(POS)任務模型訓練於 ASBC 資料集上;實體辨識(NER)任務模型訓練於 OntoNotes 資料集上。
- CNA: https://catalog.ldc.upenn.edu/LDC2011T13
- Chinese Gigaword Fifth Edition — CNA (Central News Agency) part.中文 Gigaword 第五版 — CNA(中央社)的部分。
- ASBC: http://asbc.iis.sinica.edu.tw
- Academia Sinica Balanced Corpus of Modern Chinese release 4.0.中央研究院漢語平衡語料庫第四版。
- OntoNotes: https://catalog.ldc.upenn.edu/LDC2013T19
Here is a summary of each corpus.
以下是各個資料集的一覽表。
Dataset | #Documents | #Lines | #Characters | Line Type |
---|---|---|---|---|
CNA | 2,559,520 | 13,532,445 | 1,219,029,974 | Paragraph |
ZhWiki | 1,106,783 | 5,918,975 | 495,446,829 | Paragraph |
ASBC | 19,247 | 1,395,949 | 17,572,374 | Clause |
OntoNotes | 1,911 | 48,067 | 1,568,491 | Sentence |
Here is the dataset split used for language models.
以下是用於訓練語言模型的資料集切割。
CNA+ZhWiki | #Documents | #Lines | #Characters |
---|---|---|---|
Train | 3,606,303 | 18,986,238 | 4,347,517,682 |
Dev | 30,000 | 148,077 | 32,888,978 |
Test | 30,000 | 151,241 | 35,216,818 |
Here is the dataset split used for word segmentation and part-of-speech tagging models.
以下是用於訓練斷詞及詞性標記模型的資料集切割。
ASBC | #Documents | #Lines | #Words | #Characters |
---|---|---|---|---|
Train | 15,247 | 1,183,260 | 9,480,899 | 14,724,250 |
Dev | 2,000 | 52,677 | 448,964 | 741,323 |
Test | 2,000 | 160,012 | 1,315,129 | 2,106,799 |
Here is the dataset split used for word segmentation and named entity recognition models.
以下是用於訓練實體辨識模型的資料集切割。
OntoNotes | #Documents | #Lines | #Characters | #Named-Entities |
---|---|---|---|---|
Train | 1,511 | 43,362 | 1,367,658 | 68,947 |
Dev | 200 | 2,304 | 93,535 | 7,186 |
Test | 200 | 2,401 | 107,298 | 6,977 |
The package also provide the following NLP tools.
我們的套件也提供了以下的自然語言處理工具。
- (WS) Word Segmentation 斷詞
- (POS) Part-of-Speech Tagging 詞性標記
- (NER) Named Entity Recognition 實體辨識
pip install -U ckip-transformers
Requirements:
- Python 3.6+
- PyTorch 1.5+
- HuggingFace Transformers 3.5+
The complete script of this example is https://github.com/ckiplab/ckip-transformers/blob/master/example/example.py.
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker
We provide several pretrained models for the NLP tools.
我們提供了一些適用於自然語言工具的預訓練的模型。
# Initialize drivers
ws_driver = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")
One may also load their own checkpoints using our drivers.
也可以運用我們的工具於自己訓練的模型上。
# Initialize drivers with custom checkpoints
ws_driver = CkipWordSegmenter(model_name="path_to_your_model")
pos_driver = CkipPosTagger(model_name="path_to_your_model")
ner_driver = CkipNerChunker(model_name="path_to_your_model")
To use GPU, one may specify device ID while initialize the drivers. Set to -1 (default) to disable GPU.
可於宣告斷詞等工具時指定 device 以使用 GPU,設為 -1 (預設值)代表不使用 GPU。
# Use CPU
ws_driver = CkipWordSegmenter(device=-1)
# Use GPU:0
ws_driver = CkipWordSegmenter(device=0)
The input for word segmentation and named-entity recognition must be a list of sentences.
The input for part-of-speech tagging must be a list of list of words (the output of word segmentation).
斷詞與實體辨識的輸入必須是 list of sentences。
詞性標記的輸入必須是 list of list of words。
# Input text
text = [
"傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。",
"美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。",
"空白 也是可以的~",
]
# Run pipeline
ws = ws_driver(text)
pos = pos_driver(ws)
ner = ner_driver(text)
The POS driver will automatically segment the sentence internally using there characters
',,。::;;!!??'
while running the model. (The output sentences will be concatenated back.) You may set delim_set
to any characters you want.You may set
use_delim=False
to disable this feature, or set use_delim=True
in WS and NER driver to enable this feature.詞性標記工具會自動用
',,。::;;!!??'
等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 delim_set
參數使用別的字元做切割。另外可指定
use_delim=False
已停用此功能,或於斷詞、實體辨識時指定 use_delim=True
已啟用此功能。# Enable sentence segmentation
ws = ws_driver(text, use_delim=True)
ner = ner_driver(text, use_delim=True)
# Disable sentence segmentation
pos = pos_driver(ws, use_delim=False)
# Use new line characters and tabs for sentence segmentation
pos = pos_driver(ws, delim_set='\n\t')
You may specify
batch_size
and max_length
to better utilize you machine resources.您亦可設置
batch_size
與 max_length
以更完美的利用您的機器資源。# Sets the batch size and maximum sentence length
ws = ws_driver(text, batch_size=256, max_length=128)
# Pack word segmentation and part-of-speech results
def pack_ws_pos_sentece(sentence_ws, sentence_pos):
assert len(sentence_ws) == len(sentence_pos)
res = []
for word_ws, word_pos in zip(sentence_ws, sentence_pos):
res.append(f"{word_ws}({word_pos})")
return "\u3000".join(res)
# Show results
for sentence, sentence_ws, sentence_pos, sentence_ner in zip(text, ws, pos, ner):
print(sentence)
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
for entity in sentence_ner:
print(entity)
print()
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
傅達仁(Nb) 今(Nd) 將(D) 執行(VC) 安樂死(Na) ,(COMMACATEGORY) 卻(D) 突然(D) 爆出(VJ) 自己(Nh) 20(Neu) 年(Nd) 前(Ng) 遭(P) 緯來(Nb) 體育台(Na) 封殺(VC) ,(COMMACATEGORY) 他(Nh) 不(D) 懂(VK) 自己(Nh) 哪裡(Ncd) 得罪到(VC) 電視台(Nc) 。(PERIODCATEGORY)
NerToken(word='傅達仁', ner='PERSON', idx=(0, 3))
NerToken(word='20年', ner='DATE', idx=(18, 21))
NerToken(word='緯來體育台', ner='ORG', idx=(23, 28))
美國參議院針對今天總統布什所提名的勞工部長趙小蘭展開認可聽證會,預料她將會很順利通過參議院支持,成為該國有史以來第一位的華裔女性內閣成員。
美國(Nc) 參議院(Nc) 針對(P) 今天(Nd) 總統(Na) 布什(Nb) 所(D) 提名(VC) 的(DE) 勞工部長(Na) 趙小蘭(Nb) 展開(VC) 認可(VC) 聽證會(Na) ,(COMMACATEGORY) 預料(VE) 她(Nh) 將(D) 會(D) 很(Dfa) 順利(VH) 通過(VC) 參議院(Nc) 支持(VC) ,(COMMACATEGORY) 成為(VG) 該(Nes) 國(Nc) 有史以來(D) 第一(Neu) 位(Nf) 的(DE) 華裔(Na) 女性(Na) 內閣(Na) 成員(Na) 。(PERIODCATEGORY)
NerToken(word='美國參議院', ner='ORG', idx=(0, 5))
NerToken(word='今天', ner='LOC', idx=(7, 9))
NerToken(word='布什', ner='PERSON', idx=(11, 13))
NerToken(word='勞工部長', ner='ORG', idx=(17, 21))
NerToken(word='趙小蘭', ner='PERSON', idx=(21, 24))
NerToken(word='認可聽證會', ner='EVENT', idx=(26, 31))
NerToken(word='參議院', ner='ORG', idx=(42, 45))
NerToken(word='第一', ner='ORDINAL', idx=(56, 58))
NerToken(word='華裔', ner='NORP', idx=(60, 62))
空白 也是可以的~
空白(VH) (WHITESPACE) 也(D) 是(SHI) 可以(VH) 的(T) ~(FW)
The following is a performance comparison between our tool and other tools.
以下是我們的工具與其他的工具之性能比較。
Tool | WS (F1) | POS (Acc) | WS+POS (F1) | NER (F1) |
---|---|---|---|---|
CKIP BERT Base | 97.60% | 95.67% | 94.19% | 81.18% |
CKIP ALBERT Base | 97.33% | 95.30% | 93.52% | 79.47% |
CKIP BERT Tiny | 96.98% | 95.08% | 93.13% | 74.20% |
CKIP ALBERT Tiny | 96.66% | 94.48% | 92.25% | 71.17% |
Monpa† | 92.58% | -- | 83.88% | -- |
Jeiba | 81.18% | -- | -- | -- |
† Monpa provides only 3 types of tags in NER.
† Monpa 的實體辨識僅提供三種標記而已。
The following results are tested on a different dataset.†
以下實驗在另一個資料集測試。†
Tool | WS (F1) | POS (Acc) | WS+POS (F1) | NER (F1) |
---|---|---|---|---|
CKIP BERT Base | 97.84% | 96.46% | 94.91% | 79.20% |
CkipTagger | 97.33% | 97.20% | 94.75% | 77.87% |
† Here we retrained/tested our BERT model using the same dataset with CkipTagger.
† 我們重新訓練/測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。
Copyright (c) 2023 CKIP Lab under the GPL-3.0 License.