Update on April 27th, 2023: We acknowledge that Twitter API has limited free access. Please contact Cagri Toraman (cagritoraman@gmail.com) if you have difficulties to fetch the data from Twitter API.
English hate speech detection model finetuned on Dataset v2:
https://huggingface.co/ctoraman/hate-speech-bert
Turkish hate speech detection model finetuned on Dataset v2:
https://huggingface.co/ctoraman/hate-speech-berturk
This repository contains the utilized dataset in the LREC 2022 paper "Large-Scale Hate Speech Detection with Cross-Domain Transfer". This study mainly focuses hate speech detection in Turkish and English. In addition, domain transfer success between hate domains is also examined.
There are two dataset versions.
Dataset v1: The original dataset that includes 100,000 tweets per English and Turkish, published in LREC 2022. The annotations with more than 60% agreement are included.
Dataset v2: A more reliable dataset version that includes 68,597 tweets for English and 60,310 for Turkish. The annotations with more than 80% agreement are included.
We acknowledge that some annotations in the original dataset (v1) are controversial. Therefore, we publish a more reliable dataset version (v2) that includes only the tweets with more than 80% annotator agreement. The dataset v2 has 128,907 tweets. 60,310 of them are Turkish, and 68,597 are English. Explanations of the columns of the file are as follows:
Column Name | Description |
---|---|
TweetID | Twitter ID of the tweet |
LangID | Language of the tweet 0-Turkish, 1-English |
TopicID | Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports |
HateLabel | Final hate label decision 0-Normal, 1-Offensive, 2-Hate |
Distibution of tweets in the dataset is as follows:
Lang. | Domain | Hate | Offensive | Normal | Total |
---|---|---|---|---|---|
EN | Religion Gender Race Politics Sport Total |
328 255 405 343 286 1,617 (2%) |
2,369 3,043 1,631 2,972 2,814 12,829 (19%) |
10,713 9,537 12,566 9,994 11,341 54,151 (79%) |
13,410 12,835 14,602 13,309 14,441 68,597 |
TR | Religion Gender Race Politics Sport Total |
2,281 970 1,897 3,657 4,016 12,821 (21%) |
3,814 3,385 2,276 1,529 3,930 14,934 (25%) |
5,058 8,353 8,236 6,251 4,657 32,555 (54%) |
11,153 12,708 12,409 11,437 12,603 60,310 |
Dataset labeler (hate_speech_dataset_v2_labeler.csv) This file contains the individual annotations for each tweet. There are 20 labelers, and each tweet is annotated by 5 labelers.
Column Name | Description |
---|---|
TweetID | Twitter ID of the tweet |
labeler_i | Annotation of the ith annotator 0-Normal, 1-Offensive, 2-Hate |
Using the dataset v2, we run BERT and BERTurk by applying 10-fold cross validation (as in the published version, v1). Each data split has 90% of train and 10% of test. We report the average F1 scores.
F1-Score | Neutral | Offensive | Hateful | Weighted |
---|---|---|---|---|
Bert-base-uncased (EN) | 0.968 ± 0.002 | 0.858 ± 0.008 | 0.631 ± 0.039 | 0.940 ± 0.004 |
Bert-base-turkish-uncased (TR) | 0.946 ± 0.002 | 0.852 ± 0.005 | 0.887 ± 0.005 | 0.910 ± 0.003 |
Thanks to Izzet Emre Kucukkaya for helping in the preparation of the dataset v2.
The dataset is composed of 200,000 tweets. Half of them is Turkish and other half is English. We also have domain information of the hate speech. These domains are Religion, Gender, Race, Politics, Sports. Each domain has 20,000 tweets in each respective language. 5 hate annotations of the tweet are also given. Since we followed Twitter's Terms and Conditions, publish tweet IDs not the tweet content directly. Explanations of the columns of the file are as follows:
Column Name | Description |
---|---|
TweetID | Twitter ID of the tweet |
LangID | Language of the tweet 0-Turkish, 1-English |
TopicID | Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports |
Label_1 | Annotation of the first annotator 0-Normal, 1-Offensive, 2-Hate |
Label_2 | Annotation of the second annotator 0-Normal, 1-Offensive, 2-Hate |
Label_3 | Annotation of the third annotator 0-Normal, 1-Offensive, 2-Hate |
Label_4 | Annotation of the fourth annotator 0-Normal, 1-Offensive, 2-Hate |
Label_5 | Annotation of the fifth annotator 0-Normal, 1-Offensive, 2-Hate |
HateLabel | Final hate label decision 0-Normal, 1-Offensive, 2-Hate |
Distibution of tweets in the dataset is as follows:
Lang. | Domain | Hate | Offensive | Normal | Total |
---|---|---|---|---|---|
EN | Religion Gender Race Politics Sport Total |
1,427 1,313 1,541 1,610 1,434 7,325 (7%) |
5,221 6,431 3,846 6,018 5,624 27,140 (27%) |
13,352 12,256 14,613 12,372 12,942 65,535 (66%) |
20,000 20,000 20,000 20,000 20,000 100,000 |
TR | Religion Gender Race Politics Sport Total |
5,688 2,780 5,095 7,657 6,373 27,593 (28%) |
7,435 6,521 4,905 4,253 7,633 30,747 (31%) |
6,877 10,699 10,000 8,090 5,994 41,660 (41%) |
20,000 20,000 20,000 20,000 20,000 100,000 |
Please contact Cagri Toraman (cagritoraman@gmail.com) in case of any issues with the datasets.
If you make use of this dataset, please cite following paper.
@InProceedings{toraman2022large,
author = {Toraman, Cagri and \c{S}ahinu\c{c}, Furkan and Yilmaz, Eyup Halit},
title = {Large-Scale Hate Speech Detection with Cross-Domain Transfer},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {2215--2225},
url = {https://aclanthology.org/2022.lrec-1.238}
}