Skip to content

Latest commit

 

History

History
92 lines (85 loc) · 2.02 KB

LANGUAGES.md

File metadata and controls

92 lines (85 loc) · 2.02 KB

Newspaper4k supports a large number of languages

Newspaper4k does support a large number of languages. The most important resource to support a language is the stop words list. The stop words list is a list of words that are not considered important for the analysis of the text. For example, the word "the" is a stop word in English. The stop words list is used to detect blocks of the target article and to differentiate it from the surrounding boilerplate text.

Check the table below for the full list of supported languages.

Your available languages are:

input code full name Has specialized tokenizer
af Afrikaans
ar Arabic ✔️
as Assamese
be Belarusian
bg Bulgarian
bn Bengali ✔️
br Breton
ca Catalan; Valencian
cs Czech
cy Welsh
da Danish
de German
el Greek, Modern (1453-)
en English
eo Esperanto
es Spanish; Castilian
et Estonian
eu Basque
fa Persian
fi Finnish
fr French
ga Irish
gl Galician
gu Gujarati
ha Hausa
he Hebrew
hi Hindi ✔️
hr Croatian
hu Hungarian
hy Armenian
id Indonesian
is Icelandic
it Italian
ja Japanese ✔️
ka Georgian
kn Kannada
ko Korean ✔️
ku Kurdish
lb Luxembourgish; Letzeburgesch
lt Lithuanian
lv Latvian
mk Macedonian
mn Mongolian
mr Marathi
ms Malay
my Burmese
nb Bokmål, Norwegian; Norwegian Bokmål
ne Nepali ✔️
nl Dutch; Flemish
no Norwegian
pa Panjabi; Punjabi
pl Polish
ps Pushto; Pashto
pt Portuguese
rn Rundi
ro Romanian; Moldavian; Moldovan
ru Russian
rw Kinyarwanda
si Sinhala; Sinhalese
sk Slovak
sl Slovenian
so Somali
sr Serbian
st Sotho, Southern
sv Swedish
sw Swahili
ta Tamil ✔️
te Telugu
th Thai ✔️
tl Tagalog
tr Turkish
tt Tatar
uk Ukrainian
ur Urdu
uz Uzbek
vi Vietnamese
yo Yoruba
zh Chinese ✔️
zu Zulu