-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(lang): ⚡ Rework of tokenizer. Additionally implemented new (easi…
…er) way of adding languages to the packet
- Loading branch information
1 parent
1071667
commit 0833859
Showing
36 changed files
with
1,154 additions
and
585 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
def tokenizer(text): | ||
import nltk | ||
|
||
s = nltk.stem.isri.ISRIStemmer() | ||
words = nltk.tokenize.wordpunct_tokenize(text) | ||
words = [s.stem(word) for word in words] | ||
return words |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
try: | ||
import tinysegmenter | ||
except ImportError as e: | ||
raise ImportError( | ||
"You must install tinysegmenter before using the Japapnezes tokenizer. \n" | ||
"Try pip install tinysegmenter\n" | ||
"or pip install newspaper3k[zh]\n" | ||
"or pip install newspaper3k[all]\n" | ||
) from e | ||
|
||
segmenter = tinysegmenter.TinySegmenter() | ||
tokenizer = segmenter.tokenize |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
from nltk import word_tokenize | ||
|
||
tokenizer = word_tokenize | ||
|
||
|
||
def find_stopwords(tokens, stopwords): | ||
res = [] | ||
for w in tokens: | ||
for s in stopwords: | ||
if w.endswith(s): | ||
res.append(w) | ||
break | ||
|
||
return res |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
try: | ||
import pythainlp | ||
except ImportError as e: | ||
raise ImportError( | ||
"You must install pythainlp before using the Thai tokenizer. \n" | ||
"Try pip install pythainlp\n" | ||
"or pip install newspaper3k[th]\n" | ||
"or pip install newspaper3k[all]\n" | ||
) from e | ||
|
||
tokenizer = pythainlp.word_tokenize |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
try: | ||
import jieba | ||
except ImportError as e: | ||
raise ImportError( | ||
"You must install jieba before using the Chinese tokenizer. \n" | ||
"Try pip install jieba\n" | ||
"or pip install newspaper3k[zh]\n" | ||
"or pip install newspaper3k[all]\n" | ||
) from e | ||
|
||
tokenizer = lambda x: jieba.cut(x, cut_all=True) # noqa: E731 |
Oops, something went wrong.