Skip to content

A julia package contains a collection of stop words for multiple languages.

License

Notifications You must be signed in to change notification settings

guo-yong-zhi/StopWords.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StopWords.jl

docs CI CI-nightly codecov

Stop words are the words in a negative dictionary which are filtered out before or after processing of natural language data (text) because they are insignificant. This julia package contains a collection of stop words for multiple languages. The data is sourced from: https://github.com/stopwords-iso/stopwords-iso. Currently, this package supports 57 languages, identified by their ISO 639-3 codes:

afr ara ben bre bul cat ces dan deu ell eng epo est eus fas fin fra gle glg guj hau hbs heb hin hun hye ita jpn kor kur lat lav lit mar msa nld nor pol por ron rus slk slv som sot spa swa swe tgl tha tur ukr urd vie yor zho zul

Installation

import Pkg; Pkg.add("StopWords")

Usage

The stopwords variable is the only exported symbol of this package. It can be regarded as a lazy dictionary of stop words for multiple languages. You can access the stop words for a given language using the language name or ISO 639 code. For example, to get the stop words for English, you can use stopwords["eng"], stopwords["en"], or stopwords["English"].

julia> using StopWords
julia> stopwords["eng"]
Set{String} with 1298 elements:
  "nu"
  "youd"
  "whoever"
  "shouldn"
  "null"
  "everywhere"
   
julia> stopwords["eng"] === stopwords["en"] === stopwords["English"]
true

You can also get the stop words for multiple languages at once.

julia> stopwords[["eng", "fra"]]
Set{String} with 1922 elements:
  "nu"
  "youd"
  "ont"
  "pfut"
  "whoever"
  "shouldn"
  "enfin"
  "tac"
   
julia> stopwords[["eng", "fra"]] === stopwords[("eng", "fra")] == stopwords["eng"]  stopwords["fra"]
true

You can also get the stop words for all languages at once.

julia> stopwords[:] === stopwords[] === stopwords[StopWords.supported_languages()]
true

The StopWords.supported_languages() function returns a set of all the languages currently supported by the package. To check if a specific language is supported, you can use the haskey function. And for multiple languages, you can pass a list to the haskey function.

julia> haskey(stopwords, "eng")
true
julia> haskey(stopwords, ["English", "fra"])
true
julia> haskey(stopwords, ["English", "foo"])
false

About

A julia package contains a collection of stop words for multiple languages.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages