呢個係個粵文篩選器,用嚟區分粵語同官話文本,對於篩選粵語語料好有用。個分類器會將輸入文本分成四類:
cantonese
: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度”mandarin
: 純官話文,僅含有官話特徵字詞,例如“你在哪裏”mixed
:官粵混雜文,同時含有官話同粵語特徵嘅字詞,例如“是咁的”neutral
:無特徵漢語文,唔含有官話同粵語特徵,既可以當成粵文亦可以當成官話文,例如“去學校讀書”
分類方法係官話同粵語嘅特徵字詞識別。如果同時含有官話同粵語特徵詞彙就算官粵混雜,如果唔含有任何特徵,就算冇特徵中性文本。
本分類器為咗照顧性能,淨係用咗簡單嘅邏輯判斷,所以會犧牲一定嘅分類召回率(recall)。如果想要更高嘅分類準確率,請使用本工作組嘅 cantonesedetect
包。cantonesedetect
會計算各個類別嘅分佈比例再綜合判別,而且可以精細到 6 個類別輸出,準確度更高,但係速度亦會更慢。
本篩選器嘅主要設計目標係「篩選出可以用作訓練數據嘅優質粵文」,而非「準確分類輸入文本」。所以喺判斷粵語/官話嗰陣會用偏嚴格嘅判別標準,即係會犧牲 recall 嚟換取高 precision (寧願篩漏粵文句子都唔好將官話文誤判成粵文)。
本篩選器默認所有輸入文本都用推薦用字方案書寫。如果輸入文本採用其他用字方案(有錯別字),會影響分類篩選結果。例如輸入畀本書我
分類器會輸出cantonese
,但寫成比本書我
會輸出neutral
。你可以用錯別字修正器嚟清洗被分成neutral
嘅文本,噉樣可能會得到更多粵文。
呢隻篩選器默認所有輸入文本都係傳統漢字。如果要分類簡化字文本,要將佢哋轉化成傳統漢字先。推薦使用 OpenCC嚟轉換。
本工具以字詞特徵抽出「純粵文」文本嘅策略同埋實踐方式。呢個策略首先喺以下場合提出。討論本分類器時,請引用:
@inproceedings{lau-etal-2024-extraction,
title = "The Extraction and Fine-grained Classification of Written {C}antonese Materials through Linguistic Feature Detection",
author = "Lau, Chaak-ming and
Lau, Mingfei and
To, Ann Wai Huen",
editor = "Ojha, Atul Kr. and
Ahmadi, Sina and
Cinkov{\'a}, Silvie and
Fransen, Theodorus and
Liu, Chao-Hong and
McCrae, John P.",
booktitle = "Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.eurali-1.4",
pages = "24--29",
abstract = "This paper presents a linguistically-informed, non-machine-learning tool for classifying Written Cantonese, Standard Written Chinese, and the intermediate varieties used by Cantonese-speaking users from Hong Kong, which are often grouped into a single {``}Traditional Chinese{''} label. Our approach addresses the lack of textual materials for Cantonese NLP, a consequence of a lower sociolinguistic status of Written Cantonese and the interchangeable use of these varieties by users without sufficient language labeling. The tool utilizes key strings and quotation markers, which can be reduced to string operations, to effectively extract Written Cantonese sentences and documents from materials mixed with Standard Written Chinese. This allows for the flexible and efficient extraction of high-quality Cantonese data from large datasets, catering to specific classification needs. This implementation ensures that the tool can process large amounts of data at a low cost by bypassing model-inferencing, which is particularly significant for marginalized languages. The tool also aims to provide a baseline measure for future classification systems, and the approach may be applicable to other low-resource regional or diglossic languages.",
}
「粵文」同「官話文」嘅定義同界線取決於使用者嘅語言意識形態,呢度嘅分類方法以下文所描述嘅粵文書寫體作為基礎。討論本工具採取嘅分類準則,請引用:
@incollection{lau2024ideologically,
title={Ideologically driven divergence in Cantonese vernacular writing practices},
author={Lau, Chaak Ming},
booktitle={The Politics of Language in Hong Kong},
pages={19--42},
year={2024},
publisher={Routledge}
}
Python >= 3.11
首先用 pip 安裝
pip install canto-filter
你可以喺 Python 代碼入面用,亦都可以直接喺命令行入面用。
本篩選器剩得一個函數 judge()
,輸入一句話輸出佢嘅語言分類:
from cantofilter import judge
print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
首先要有一個輸入文檔,例如input.txt
,入面每行一個句子.
然後運行下面命令
cantofilter --input input.txt > output.txt
噉樣會得到一個 output.txt
,入面有由 \t 分成嘅兩列,第一列係判斷標籤,第二列係句子原文本。
如果你想直接篩選出某一類嘅文本,噉可以加一個 --mode <LABEL>
參數喺後面,例如
cantofilter --input input.txt --mode cantonese > output.txt
噉樣輸出嘅 output.txt
就會係純粵文句子。如果想剩係要官話、官粵混合或者中性文本,將個 --mode
參數定成 mandarin
、mixed
、neutral
就得。
你亦都可以剩係輸出啲句子嘅分類結果,用 --mode label
就得:
cantofilter --input input.txt --mode label > output.txt
噉樣嘅 output.txt
剩得一列,全部都係分類標籤。
This is a text filter for Cantonese, designed for filtering Cantonese text corpus. It classifies input sentences with four output labels:
cantonese
: Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度mandarin
: Pure Mandarin text, contains Mandarin-feature words. E.g. 你在哪裏mixed
:Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的neutral
:No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書
The filter is regex rule-based, by detecting Mandarin and Cantonese feature characters and words. If a sentence contains both Cantonese and Mandarin feature words, then it is a mixed-Cantonese-Mandarin sentence. If it contains neither features, it is a no-feature, neutral Chinese text.
This package uses simple if-else logic to classify input sentences for better performance, at the price of lower recalls. If you want more accurate classification sacrificing performance, please use our another package cantonesedetect
. cantonesedetect
uses calculates the proportions linguistic features and classify the input sentence based on the aggregated number. It can achieve more fine-grained classification with 6 output labels.
This filter is designed for the purpose of "obtaining high-quality Cantonese text", as opposed to "accurately classifying input texts". Therefore, it maximizes precision at the price of recall, to minimize the false positive rate / avoid including potential Mandarin sentences (we rather miss some Cantonese sentences, than mistaking potential Mandarin sentences as Cantonese).
This filter assumes all input text written in the recommended orthography. Spelling errors or typos in input text might affect the classification result. For instance, 畀本書我
yields cantonese
, while 比本書我
yields neutral
. You can use the spelling corrector to correct the neutral
text, which might give you more Cantonese text.
This filter assumes all input text in Traditional Chinese characters. If you want to filter texts written in simplified characters, please convert them into Traditional characters first. We recommend using OpenCC to do the conversion.
The implementation and methodology of this filter was first proposed in the following contexts. below. When discussing this filter, please cite:
@inproceedings{lau-etal-2024-extraction,
title = "The Extraction and Fine-grained Classification of Written {C}antonese Materials through Linguistic Feature Detection",
author = "Lau, Chaak-ming and
Lau, Mingfei and
To, Ann Wai Huen",
editor = "Ojha, Atul Kr. and
Ahmadi, Sina and
Cinkov{\'a}, Silvie and
Fransen, Theodorus and
Liu, Chao-Hong and
McCrae, John P.",
booktitle = "Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.eurali-1.4",
pages = "24--29",
abstract = "This paper presents a linguistically-informed, non-machine-learning tool for classifying Written Cantonese, Standard Written Chinese, and the intermediate varieties used by Cantonese-speaking users from Hong Kong, which are often grouped into a single {``}Traditional Chinese{''} label. Our approach addresses the lack of textual materials for Cantonese NLP, a consequence of a lower sociolinguistic status of Written Cantonese and the interchangeable use of these varieties by users without sufficient language labeling. The tool utilizes key strings and quotation markers, which can be reduced to string operations, to effectively extract Written Cantonese sentences and documents from materials mixed with Standard Written Chinese. This allows for the flexible and efficient extraction of high-quality Cantonese data from large datasets, catering to specific classification needs. This implementation ensures that the tool can process large amounts of data at a low cost by bypassing model-inferencing, which is particularly significant for marginalized languages. The tool also aims to provide a baseline measure for future classification systems, and the approach may be applicable to other low-resource regional or diglossic languages.",
}
The definitions and boundaries of 'Cantonese text' and 'Mandarin text' depend on the user's language ideology. The classification method used here is based on the Cantonese written style described in the following text. When discussing the criteria adopted by this tool, please cite:
@incollection{lau2024ideologically,
title={Ideologically driven divergence in Cantonese vernacular writing practices},
author={Lau, Chaak Ming},
booktitle={The Politics of Language in Hong Kong},
pages={19--42},
year={2024},
publisher={Routledge}
}
Python >= 3.11
Install the package with pip first
pip install canto-filter
This package can be used in python codes, or as a CLI tool.
There is only one function in this package, judge()
, which accepts a string input and outputs one of the labels:
from cantofilter import judge
print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
Assume an input text file, e.g. input.txt
where each line is a sentence.
Then run
cantofilter --input input.txt > output.txt
There will be a output.txt
which has two columns. The first column is the language label, and the second column is the original input text.
If you want only one class of text, use the --mode <LABEL>
argument. Say if you want pure Cantonese text only:
cantofilter --input input.txt --mode cantonese > output.txt
The output.txt
will contain only Cantonese text.
If you want to include the classification labels in the output, use --mode label
like this:
cantofilter --input input.txt --mode label > output.txt
Then your output.txt
will contain only classification results of the input sentences.