Python library for managing common stop words in 39 languages.
Contents
Better than a long speech, here a direct introduction:
>>> from mots_vides import stop_words >>> english_stop_words = stop_words('en') >>> text = """ ... Even though using "lorem ipsum" often arouses curiosity ... due to its resemblance to classical Latin, ... it is not intended to have meaning. ... """ >>> print(english_stop_words.rebase(text)) XXXX XXXXXX XXXXX "lorem ipsum" XXXXX arouses curiosity XXX XX XXX resemblance XX classical Latin, XX XX XXX intended XX XXXX meaning. >>> print(english_stop_words.rebase(text, '').split()) ['"lorem', 'ipsum"', 'arouses', 'curiosity', 'resemblance', 'classical', 'Latin,', 'intended', 'meaning.']
Mots vides also provides two classes for managing the stop words in your language.
StopWord
which is a container for a collection of stop words.
By default is language agnostic, but can be easily manipulated to create
the collection:
>>> from mots_vides import StopWord >>> french_stop_words = StopWord('french', ['le', 'la', 'les']) >>> french_stop_words += StopWord('french', ['un', 'une', 'des']) >>> french_stop_words += ['or', 'ni', 'car'] >>> french_stop_words += 'assez' >>> french_stop_words += u'aussitôt' >>> print(sorted(french_stop_words)) ['assez', u'aussitôt', 'car', 'des', 'la', 'le', 'les', 'ni', 'or', 'un', 'une']
StopWordFactory
is a factory for initializing StopWord
objects by
language and the appropriate collection of stop words.
>>> from mots_vides import StopWordFactory >>> factory = StopWordFactory() >>> french_stop_words = factory.get_stop_words('french') >>> print(len(french_stop_words)) 577
You can also use international language code to query a collection:
>>> french_stop_words = factory.get_stop_words('fr') >>> print(len(french_stop_words)) 577
If the required language does not exist a StopWordError
is raised,
unless the fail_safe
parameter is set to True
:
>>> klingon_stop_words = factory.get_stop_words('klingon') StopWordError: Stop words are not available in "klingon". >>> klingon_stop_words = factory.get_stop_words('klingon', fail_safe=True) >>> print(len(klingon_stop_words)) 0
- Arabic
- Armenian
- Basque
- Bengali
- Bulgarian
- Catalan
- Chinese
- Czech
- Danish
- Dutch
- English
- Finnish
- French
- Galician
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Japanese
- Korean
- Latvian
- Lithuanian
- Marathi
- Norwegian
- Persian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Spanish
- Swedish
- Thai
- Turkish
- Ukrainian
- Urdu
Tested with Python 2.6, 2.7, 3.2, 3.3, 3.4.
Mots vides means stop words in french.
Inspired from https://github.com/Alir3z4/python-stop-words
- Fix cache system for Python 3
- Fix potential issue in factory.get_available_languages
- Fix packaging
- Add a rebaser command script
- Initial release
- Development release