The Dataset for Hate Speech Detection in Indonesian (https://github.com/ialfina/id-hatespeech-detection).
The dataset consists of two data columns : label - tweet. It consists of 713 tweets in Indonesian. The labels:
- Non_HS for "non-hate-speech" tweet (453).
- HS for "hate-speech" tweet (260).
- Python 3.7 or above
- Modules:
- pandas
- numpy
- seaborn
- matplotlib
- re
- TextBlob
- nltk
- stopwords
- StemmerFactory
- Sastrawi
- sklearn
- train_test_split
Histogram : The distribution of "the text data length".
Since the dataset is unbalanced, we do over-sampling to create a balanced dataset. So, we get:
- Tokenizing : generate word lists and remove punctuation.
- Filtering : remove stopwords and words with unusual symbols.
- Stemming : find basic indonesian words from tweet
For splitting data, we use train_test_split:
- Train data = 80%
- Test data = 20%
We use 3 classifier:
- Random Forest
- Multinomial Naive Bayes
- K Nearest Neighbors
Classifier | Macro F1 | Accuracy | Recall |
---|---|---|---|
Random Forest | 0.93 | 0.93 | 0.94 |
Multinomial NB | 0.85 | 0.85 | 0.86 |
KNN | 0.81 | 0.81 | 0.83 |
Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata, "Hate Speech Detection in Indonesian Language: A Dataset and Preliminary Study ", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017(ICACSIS 2017).