- Auto-generated stopwords for South African Languages
- We present a list of auto-translated stopwords from English and adapt them to native South African Bantu Languages
- The data is provided in JSON Lines format. Here is an example of using the stopwords in Python:
import json
# Load the stop words from the JSON lines file
stop_words = []
with open('za_stopwords.main.jsonl', 'r', encoding='utf-8') as file:
for line in file:
stop_words.append(json.loads(line.strip()))
# Example: Print stop words in Zulu
for word in stop_words:
print(f"English: {word['eng']}, Zulu: {word['zul']}")
Refer to
training_example.py
for a full working example on how you can use these stopwords in your model training scripts
- ven - Tshivenda
- tso - Xitsonga
- sot - Southern Sotho
- nso - Northen Sotho
- tsn - Setswana
- zul - IsiZulu
- xho - IsiXhosa
- ...
- Coming soon:
- nbl
- ssw
- Feel free to create a pull request if you have a suggestion
- This work is published under the CC BY-NC 4.0 license
@misc{multilingual_stop_words,
author = {Ndamulelo Nemakhavhani},
title = {Autogenerated Stop Words for South African Bantu Languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ndamulelonemakh/our-stopwords}}
}
- If you have any questions or suggestions, please feel free to open an issue or contact us.