-
Notifications
You must be signed in to change notification settings - Fork 816
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat:integrate nltk data into docker image
- Loading branch information
1 parent
9d1a57d
commit 27eea08
Showing
77 changed files
with
778,808 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
nltk_data/taggers/averaged_perceptron_tagger_eng/averaged_perceptron_tagger_eng.classes.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
[".", "(", ")", ":", "''", "EX", "JJS", "WRB", "VBG", "VBP", "NN", "SYM", "VB", "UH", "NNPS", "NNP", "``", "$", "NNS", "JJR", "MD", "RP", "VBD", "DT", "POS", "RBR", ",", "VBZ", "PDT", "VBN", "WP$", "WDT", "WP", "PRP$", "CD", "IN", "#", "CC", "RB", "FW", "RBS", "PRP", "LS", "JJ", "TO"] |
1 change: 1 addition & 0 deletions
1
nltk_data/taggers/averaged_perceptron_tagger_eng/averaged_perceptron_tagger_eng.tagdict.json
Large diffs are not rendered by default.
Oops, something went wrong.
1 change: 1 addition & 0 deletions
1
nltk_data/taggers/averaged_perceptron_tagger_eng/averaged_perceptron_tagger_eng.weights.json
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected) | ||
|
||
Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have | ||
been contributed by various people using NLTK for sentence boundary detection. | ||
|
||
For information about how to use these models, please confer the tokenization HOWTO: | ||
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html | ||
and chapter 3.8 of the NLTK book: | ||
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation | ||
|
||
There are pretrained tokenizers for the following languages: | ||
|
||
File Language Source Contents Size of training corpus(in tokens) Model contributed by | ||
======================================================================================================================================================================= | ||
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss | ||
Literarni Noviny | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss | ||
(Berlingske Avisdata, Copenhagen) Weekend Avisen | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss | ||
(American) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss | ||
Text Bank (Suomen Kielen newspapers | ||
Tekstipankki) | ||
Finnish Center for IT Science | ||
(CSC) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss | ||
(European) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss | ||
(Switzerland) CD-ROM | ||
(Uses "ss" | ||
instead of "ß") | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss | ||
(Bokmål and Information Technologies, | ||
Nynorsk) Bergen | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner | ||
(http://www.nkjp.pl/) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss | ||
(Brazilian) (Linguateca) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss | ||
Slovene Academy for Arts | ||
and Sciences | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss | ||
(European) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss | ||
(and some other texts) | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss | ||
(Türkçe Derlem Projesi) | ||
University of Ankara | ||
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ||
|
||
The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to | ||
Unicode using the codecs module. | ||
|
||
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. | ||
Computational Linguistics 32: 485-525. | ||
|
||
---- Training Code ---- | ||
|
||
# import punkt | ||
import nltk.tokenize.punkt | ||
|
||
# Make a new Tokenizer | ||
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer() | ||
|
||
# Read in training corpus (one example: Slovene) | ||
import codecs | ||
text = codecs.open("slovene.plain","Ur","iso-8859-2").read() | ||
|
||
# Train tokenizer | ||
tokenizer.train(text) | ||
|
||
# Dump pickled tokenizer | ||
import pickle | ||
out = open("slovene.pickle","wb") | ||
pickle.dump(tokenizer, out) | ||
out.close() | ||
|
||
--------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
t | ||
množ | ||
např | ||
j.h | ||
man | ||
ú | ||
jug | ||
dr | ||
bl | ||
ml | ||
okr | ||
st | ||
uh | ||
šp | ||
judr | ||
u.s.a | ||
p | ||
arg | ||
žitě | ||
st.celsia | ||
etc | ||
p.s | ||
t.r | ||
lok | ||
mil | ||
ict | ||
n | ||
tl | ||
min | ||
č | ||
d | ||
al | ||
ravenně | ||
mj | ||
nar | ||
plk | ||
s.p | ||
a.g | ||
roč | ||
b | ||
zdi | ||
r.s.c | ||
přek | ||
m | ||
gen | ||
csc | ||
mudr | ||
vic | ||
š | ||
sb | ||
resp | ||
tzn | ||
iv | ||
s.r.o | ||
mar | ||
w | ||
čs | ||
vi | ||
tzv | ||
ul | ||
pen | ||
zv | ||
str | ||
čp | ||
org | ||
rak | ||
sv | ||
pplk | ||
u.s | ||
prof | ||
c.k | ||
op | ||
g | ||
vii | ||
kr | ||
ing | ||
j.o | ||
drsc | ||
m3 | ||
l | ||
tr | ||
ceo | ||
ch | ||
fuk | ||
vl | ||
viii | ||
líp | ||
hl.m | ||
t.zv | ||
phdr | ||
o.k | ||
tis | ||
doc | ||
kl | ||
ard | ||
čkd | ||
pok | ||
apod | ||
r | ||
př | ||
a.s | ||
j | ||
jr | ||
i.m | ||
e | ||
kupř | ||
f | ||
tř | ||
xvi | ||
mir | ||
atď | ||
vr | ||
r.i.v | ||
hl | ||
kv | ||
t.j | ||
y | ||
q.p.r |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
i dejmala | ||
##number## prosince | ||
h steina | ||
##number## listopadu | ||
a dvořák | ||
v klaus | ||
i čnhl | ||
##number## wladyslawowo | ||
##number## letech | ||
a jiráska | ||
a dubček | ||
##number## štrasburk | ||
##number## juniorské | ||
##number## století | ||
##number## kola | ||
##number## pád | ||
##number## května | ||
##number## týdne | ||
v dlouhý | ||
k design | ||
##number## červenec | ||
i ligy | ||
##number## kolo | ||
z svěrák | ||
##number## mája | ||
##number## šimková | ||
a bělého | ||
a bradáč | ||
##number## ročníku | ||
##number## dubna | ||
a vivaldiho | ||
v mečiara | ||
c carrićre | ||
##number## sjezd | ||
##number## výroční | ||
##number## kole | ||
##number## narozenin | ||
k maleevová | ||
i čnfl | ||
##number## pádě | ||
##number## září | ||
##number## výročí | ||
a dvořáka | ||
h g. | ||
##number## ledna | ||
a dvorský | ||
h měsíc | ||
##number## srpna | ||
##number## tř. | ||
a mozarta | ||
##number## sudetoněmeckých | ||
o sokolov | ||
k škrach | ||
v benda | ||
##number## symfonie | ||
##number## července | ||
x šalda | ||
c abrahama | ||
a tichý | ||
##number## místo | ||
k bielecki | ||
v havel | ||
##number## etapu | ||
a dubčeka | ||
i liga | ||
##number## světový | ||
v klausem | ||
##number## ženy | ||
##number## létech | ||
##number## minutě | ||
##number## listopadem | ||
##number## místě | ||
o vlček | ||
k peteraje | ||
i sponzor | ||
##number## června | ||
##number## min. | ||
##number## oprávněnou | ||
##number## květnu | ||
##number## aktu | ||
##number## květnem | ||
##number## října | ||
i rynda | ||
##number## února | ||
i snfl | ||
a mozart | ||
z košler | ||
a dvorskému | ||
v marhoul | ||
v mečiar | ||
##number## ročník | ||
##number## máje | ||
v havla | ||
k gott | ||
s bacha | ||
##number## ad |
Oops, something went wrong.