Skip to content

Commit

Permalink
feat:integrate nltk data into docker image
Browse files Browse the repository at this point in the history
  • Loading branch information
christinestraub committed Jan 2, 2025
1 parent 9d1a57d commit 27eea08
Show file tree
Hide file tree
Showing 77 changed files with 778,808 additions and 2 deletions.
7 changes: 5 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ COPY unstructured unstructured
COPY test_unstructured test_unstructured
COPY example-docs example-docs

# Copy the downloaded NLTK data folder to your local environment.s
COPY ./nltk_data /home/notebook-user/nltk_data

RUN chown -R notebook-user:notebook-user /app && \
apk add font-ubuntu git && \
fc-cache -fv && \
Expand All @@ -18,8 +21,8 @@ USER notebook-user

RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';'

RUN python3.11 -c "import os; os.makedirs('/home/notebook-user/nltk_data', exist_ok=True)" && \
python3.11 -c "from nltk.downloader import download; download('punkt_tab'); download('averaged_perceptron_tagger_eng')"
# Command to check if NLTK data has been copied correctly
RUN python3.11 -c "import nltk; print(nltk.data.find('tokenizers/punkt_tab'))"

RUN python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[".", "(", ")", ":", "''", "EX", "JJS", "WRB", "VBG", "VBP", "NN", "SYM", "VB", "UH", "NNPS", "NNP", "``", "$", "NNS", "JJR", "MD", "RP", "VBD", "DT", "POS", "RBR", ",", "VBZ", "PDT", "VBN", "WP$", "WDT", "WP", "PRP$", "CD", "IN", "#", "CC", "RB", "FW", "RBS", "PRP", "LS", "JJ", "TO"]

Large diffs are not rendered by default.

Large diffs are not rendered by default.

98 changes: 98 additions & 0 deletions nltk_data/tokenizers/punkt_tab/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)

Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
been contributed by various people using NLTK for sentence boundary detection.

For information about how to use these models, please confer the tokenization HOWTO:
http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
and chapter 3.8 of the NLTK book:
http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation

There are pretrained tokenizers for the following languages:

File Language Source Contents Size of training corpus(in tokens) Model contributed by
=======================================================================================================================================================================
czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
Literarni Noviny
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
(Berlingske Avisdata, Copenhagen) Weekend Avisen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
(American)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
Text Bank (Suomen Kielen newspapers
Tekstipankki)
Finnish Center for IT Science
(CSC)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
(Switzerland) CD-ROM
(Uses "ss"
instead of "ß")
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
(Bokmål and Information Technologies,
Nynorsk) Bergen
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
(http://www.nkjp.pl/)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
(Brazilian) (Linguateca)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
Slovene Academy for Arts
and Sciences
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
(European)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
(and some other texts)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
(Türkçe Derlem Projesi)
University of Ankara
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
Unicode using the codecs module.

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.

---- Training Code ----

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in training corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()

# Train tokenizer
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()

---------
118 changes: 118 additions & 0 deletions nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
t
množ
např
j.h
man
ú
jug
dr
bl
ml
okr
st
uh
šp
judr
u.s.a
p
arg
žitě
st.celsia
etc
p.s
t.r
lok
mil
ict
n
tl
min
č
d
al
ravenně
mj
nar
plk
s.p
a.g
roč
b
zdi
r.s.c
přek
m
gen
csc
mudr
vic
š
sb
resp
tzn
iv
s.r.o
mar
w
čs
vi
tzv
ul
pen
zv
str
čp
org
rak
sv
pplk
u.s
prof
c.k
op
g
vii
kr
ing
j.o
drsc
m3
l
tr
ceo
ch
fuk
vl
viii
líp
hl.m
t.zv
phdr
o.k
tis
doc
kl
ard
čkd
pok
apod
r
a.s
j
jr
i.m
e
kupř
f
xvi
mir
atď
vr
r.i.v
hl
kv
t.j
y
q.p.r
96 changes: 96 additions & 0 deletions nltk_data/tokenizers/punkt_tab/czech/collocations.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
i dejmala
##number## prosince
h steina
##number## listopadu
a dvořák
v klaus
i čnhl
##number## wladyslawowo
##number## letech
a jiráska
a dubček
##number## štrasburk
##number## juniorské
##number## století
##number## kola
##number## pád
##number## května
##number## týdne
v dlouhý
k design
##number## červenec
i ligy
##number## kolo
z svěrák
##number## mája
##number## šimková
a bělého
a bradáč
##number## ročníku
##number## dubna
a vivaldiho
v mečiara
c carrićre
##number## sjezd
##number## výroční
##number## kole
##number## narozenin
k maleevová
i čnfl
##number## pádě
##number## září
##number## výročí
a dvořáka
h g.
##number## ledna
a dvorský
h měsíc
##number## srpna
##number## tř.
a mozarta
##number## sudetoněmeckých
o sokolov
k škrach
v benda
##number## symfonie
##number## července
x šalda
c abrahama
a tichý
##number## místo
k bielecki
v havel
##number## etapu
a dubčeka
i liga
##number## světový
v klausem
##number## ženy
##number## létech
##number## minutě
##number## listopadem
##number## místě
o vlček
k peteraje
i sponzor
##number## června
##number## min.
##number## oprávněnou
##number## květnu
##number## aktu
##number## květnem
##number## října
i rynda
##number## února
i snfl
a mozart
z košler
a dvorskému
v marhoul
v mečiar
##number## ročník
##number## máje
v havla
k gott
s bacha
##number## ad
Loading

0 comments on commit 27eea08

Please sign in to comment.