Skip to content

Commit

Permalink
發佈 1.0.3 (#8)
Browse files Browse the repository at this point in the history
* 更新詞表同README

* 發佈 1.0.3
  • Loading branch information
laubonghaudoi authored Dec 22, 2023
1 parent b867f80 commit 9378541
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 10 deletions.
16 changes: 14 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,13 @@

分類方法係官話同粵語嘅特徵字詞識別。如果同時含有官話同粵語特徵詞彙就算官粵混雜,如果唔含有任何特徵,就算冇特徵中性文本。

### 設計思想同假設

本篩選器嘅主要設計目標係「篩選出可以用作訓練數據嘅優質粵文」,而非「準確分類輸入文本」。所以喺判斷粵語/官話嗰陣會用偏嚴格嘅判別標準,即係會犧牲 recall 嚟換取高 precision (寧願篩漏粵文句子都唔好將官話文誤判成粵文)。

注意:呢隻分類器**默認所有輸入文本都係傳統漢字**。如果要分類簡化字文本,要將佢哋轉化成傳統漢字先。推薦使用 [OpenCC](https://github.com/BYVoid/OpenCC)嚟轉換。
本篩選器**默認所有輸入文本都用[推薦用字方案](https://jyutping.org/blog/typo/)書寫**。如果輸入文本採用其他用字方案(有錯別字),會影響分類篩選結果。例如輸入`畀本書我`分類器會輸出`cantonese`,但寫成`比本書我`會輸出`neutral`。你可以用[錯別字修正器](https://github.com/CanCLID/typo-corrector)嚟清洗被分成`neutral`嘅文本,噉樣可能會得到更多粵文。

呢隻篩選器**默認所有輸入文本都係傳統漢字**。如果要分類簡化字文本,要將佢哋轉化成傳統漢字先。推薦使用 [OpenCC](https://github.com/BYVoid/OpenCC)嚟轉換。

### 引用本篩選器

Expand Down Expand Up @@ -92,6 +96,8 @@ Python >= 3.6

# Cantonese text filter

## Overview

This is a text filter for Cantonese, designed for filtering Cantonese text corpus. It classifies input sentences with four output labels:

1. `cantonese`: Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度
Expand All @@ -101,7 +107,13 @@ This is a text filter for Cantonese, designed for filtering Cantonese text corpu

The filter is regex rule-based, by detecting Mandarin and Cantonese feature characters and words. If a sentence contains both Cantonese and Mandarin feature words, then it is a mixed-Cantonese-Mandarin sentence. If it contains neither features, it is a no-feature, neutral Chinese text.

Note: This filter **assumes all input text in Traditional Chinese characters**. If you want to filter texts written in simplified characters, please convert them into Traditional characters first. We recommend using [OpenCC](https://github.com/BYVoid/OpenCC) to do the conversion.
### Design priciples and assumptions

This filter is designed for the purpose of "obtaining high-quality Cantonese text", as opposed to "accurately classifying input texts". Therefore, it maximizes precision at the price of recall, to minimize the false positive rate / avoid including potential Mandarin sentences (we rather miss some Cantonese sentences, than mistaking potential Mandarin sentences as Cantonese).

This filter **assumes all input text written in [the recommended orthography](https://jyutping.org/blog/typo/)**. Spelling errors or typos in input text might affect the classification result. For instance, `畀本書我` yields `cantonese`, while `比本書我` yields `neutral`. You can use the [spelling corrector](https://github.com/CanCLID/typo-corrector) to correct the `neutral` text, which might give you more Cantonese text.

This filter **assumes all input text in Traditional Chinese characters**. If you want to filter texts written in simplified characters, please convert them into Traditional characters first. We recommend using [OpenCC](https://github.com/BYVoid/OpenCC) to do the conversion.

### Citing this package

Expand Down
15 changes: 8 additions & 7 deletions cantofilter/judge.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,24 @@

canto_unique = re.compile(
r'[嘅嗰啲咗佢喺咁噉冇啩哋畀嚟諗惗乜嘢閪撚𨳍𨳊瞓睇㗎餸𨋢摷喎嚿噃嚡嘥嗮啱揾搵喐逳噏𢳂岋糴揈捹撳㩒𥄫攰癐冚孻冧𡃁嚫跣𨃩瀡氹嬲掟孭黐唞㪗埞忟𢛴]|' +
r'唔[係得會好識使洗駛通知到去走掂該]|點[樣會做得解]|[琴尋噚聽第]日|[而依]家|家[下陣]|[真就]係|邊[度個位科]|' +
r'[嚇凍攝整揩逢淥浸激][親嚫]|[橫搞傾諗得唔]掂|仲[有係話要得好衰唔]|返[學工去歸]|' +
r'屋企|收皮|慳錢|傾[偈計]|幫襯|執[好生實返輸]|求其|是[但旦]|[濕溼]碎|零舍|肉[赤緊]')
r'唔[係得會好識使洗駛通知到去走掂該錯差]|點[樣會做得解]|[琴尋噚聽第]日|[而依]家|家[下陣]|[真就實梗又話都]係|邊[度個位科]|' +
r'[嚇凍攝整揩逢淥浸激][親嚫]|[橫搞傾諗得唔]掂|仲[有係話要得好衰唔]|返[學工去歸]|執[好生實返輸]|' +
r'屋企|收皮|慳錢|傾[偈計]|幫襯|求其|是[但旦]|[濕溼]碎|零舍|肉[赤緊酸]|核突|同埋|勁[秋抽]|')
mando_unique = re.compile(r'[這哪您們唄咱啥甭她]|還[是好有]')
# “在不” 因為太多融入粵語所以唔喺判別標準內
mando_feature = re.compile(r'[那是的他吧沒麼么些了卻説說吃弄]|而已')
mando_feature = re.compile(r'[那是的他它吧沒麼么些了卻説說吃弄]|而已')
mando_loan = re.compile(r'亞利桑那|剎那|巴塞羅那|薩那|沙那|哈瓦那|印第安那|那不勒斯|支那|' +
r'是[否日次非但旦]|[利於]是|唯命是從|頭頭是道|似是而非|自以為是|俯拾皆是|撩是鬥非|莫衷一是|唯才是用|' +
r'[目綠藍紅中]的|的[士確式]|波羅的海|眾矢之的|的而且確|大眼的度|' +
r'些[微少許小]|' +
r'[淹沉浸覆湮埋沒出]沒|沒[落頂收]|神出鬼沒|' +
r'了[結無斷當然哥結得解事之]|[未明]了|不得了|大不了|' +
r'他[信人國日殺鄉]|[其利無排維結]他|馬耳他|他加祿|他山之石|' +
r'其[它]|' +
r'[酒網水貼]吧|吧[台臺枱檯]|' +
r'[退忘阻]卻|卻步|' +
r'[遊游小傳解學假淺眾衆訴論][説說]|[說説][話服明]|自圓其[説說]|長話短[說説]|不由分[說説]|' +
r'吃[虧苦力]|' +
r'吃[虧苦力]|' +
r'弄[堂]')


Expand Down Expand Up @@ -63,7 +64,7 @@ def judge(s: str) -> str:
'''
判斷一句話係粵語、官話、官話溝粵語定係中性
Judge whether a sentence is Cantonese, Mandarin, mixed-Mandarin-Cantonese, or neutral.
Args:
s (str): 一句話 A sentence
Returns:
Expand All @@ -86,7 +87,7 @@ def judge(s: str) -> str:
return "mixed"
else:
# 含有官話成分,冇官話專屬詞,有可能官話借詞,亦都算粵語
# Contain Mandarin features, no Mandarin unique words,
# Contain Mandarin features, no Mandarin unique words,
# which may be Mandarin loan words that also count as Cantonese
if is_all_loan(s):
# 所有官話特色都係借詞,所以仲係算粵語
Expand Down
2 changes: 1 addition & 1 deletion cantofilter/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.0.2"
__version__ = "1.0.3"

0 comments on commit 9378541

Please sign in to comment.