Skip to content

Turkish and English Dataset from "Large-Scale Hate Speech Detection with Cross-Domain Transfer"

License

Notifications You must be signed in to change notification settings

avaapm/hatespeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Update on April 27th, 2023: We acknowledge that Twitter API has limited free access. Please contact Cagri Toraman (cagritoraman@gmail.com) if you have difficulties to fetch the data from Twitter API.

Published Models:

English hate speech detection model finetuned on Dataset v2:

https://huggingface.co/ctoraman/hate-speech-bert

Turkish hate speech detection model finetuned on Dataset v2:

https://huggingface.co/ctoraman/hate-speech-berturk

Large-Scale Hate Speech Datasets

This repository contains the utilized dataset in the LREC 2022 paper "Large-Scale Hate Speech Detection with Cross-Domain Transfer". This study mainly focuses hate speech detection in Turkish and English. In addition, domain transfer success between hate domains is also examined.

There are two dataset versions.

Dataset v1: The original dataset that includes 100,000 tweets per English and Turkish, published in LREC 2022. The annotations with more than 60% agreement are included.

Dataset v2: A more reliable dataset version that includes 68,597 tweets for English and 60,310 for Turkish. The annotations with more than 80% agreement are included.

Dataset v2 (hate_speech_dataset_v2.csv)

We acknowledge that some annotations in the original dataset (v1) are controversial. Therefore, we publish a more reliable dataset version (v2) that includes only the tweets with more than 80% annotator agreement. The dataset v2 has 128,907 tweets. 60,310 of them are Turkish, and 68,597 are English. Explanations of the columns of the file are as follows:

Column Name Description
TweetID Twitter ID of the tweet
LangID Language of the tweet 0-Turkish, 1-English
TopicID Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports
HateLabel Final hate label decision 0-Normal, 1-Offensive, 2-Hate

Distibution of tweets in the dataset is as follows:

Lang. Domain Hate Offensive Normal Total
EN Religion
Gender
Race
Politics
Sport
Total
328
255
405
343
286
1,617 (2%)
2,369
3,043
1,631
2,972
2,814
12,829 (19%)
10,713
9,537
12,566
9,994
11,341
54,151 (79%)
13,410
12,835
14,602
13,309
14,441
68,597
TR Religion
Gender
Race
Politics
Sport
Total
2,281
970
1,897
3,657
4,016
12,821 (21%)
3,814
3,385
2,276
1,529
3,930
14,934 (25%)
5,058
8,353
8,236
6,251
4,657
32,555 (54%)
11,153
12,708
12,409
11,437
12,603
60,310

Dataset labeler (hate_speech_dataset_v2_labeler.csv) This file contains the individual annotations for each tweet. There are 20 labelers, and each tweet is annotated by 5 labelers.

Column Name Description
TweetID Twitter ID of the tweet
labeler_i Annotation of the ith annotator 0-Normal, 1-Offensive, 2-Hate

Using the dataset v2, we run BERT and BERTurk by applying 10-fold cross validation (as in the published version, v1). Each data split has 90% of train and 10% of test. We report the average F1 scores.

F1-Score Neutral Offensive Hateful Weighted
Bert-base-uncased (EN) 0.968 ± 0.002 0.858 ± 0.008 0.631 ± 0.039 0.940 ± 0.004
Bert-base-turkish-uncased (TR) 0.946 ± 0.002 0.852 ± 0.005 0.887 ± 0.005 0.910 ± 0.003

Thanks to Izzet Emre Kucukkaya for helping in the preparation of the dataset v2.

Dataset v1 (hate_speech_dataset.csv)

The dataset is composed of 200,000 tweets. Half of them is Turkish and other half is English. We also have domain information of the hate speech. These domains are Religion, Gender, Race, Politics, Sports. Each domain has 20,000 tweets in each respective language. 5 hate annotations of the tweet are also given. Since we followed Twitter's Terms and Conditions, publish tweet IDs not the tweet content directly. Explanations of the columns of the file are as follows:

Column Name Description
TweetID Twitter ID of the tweet
LangID Language of the tweet 0-Turkish, 1-English
TopicID Domain of the topic 0-Religion, 1-Gender, 2-Race, 3-Politics, 4-Sports
Label_1 Annotation of the first annotator 0-Normal, 1-Offensive, 2-Hate
Label_2 Annotation of the second annotator 0-Normal, 1-Offensive, 2-Hate
Label_3 Annotation of the third annotator 0-Normal, 1-Offensive, 2-Hate
Label_4 Annotation of the fourth annotator 0-Normal, 1-Offensive, 2-Hate
Label_5 Annotation of the fifth annotator 0-Normal, 1-Offensive, 2-Hate
HateLabel Final hate label decision 0-Normal, 1-Offensive, 2-Hate

Distibution of tweets in the dataset is as follows:

Lang. Domain Hate Offensive Normal Total
EN Religion
Gender
Race
Politics
Sport
Total
1,427
1,313
1,541
1,610
1,434
7,325 (7%)
5,221
6,431
3,846
6,018
5,624
27,140 (27%)
13,352
12,256
14,613
12,372
12,942
65,535 (66%)
20,000
20,000
20,000
20,000
20,000
100,000
TR Religion
Gender
Race
Politics
Sport
Total
5,688
2,780
5,095
7,657
6,373
27,593 (28%)
7,435
6,521
4,905
4,253
7,633
30,747 (31%)
6,877
10,699
10,000
8,090
5,994
41,660 (41%)
20,000
20,000
20,000
20,000
20,000
100,000

Contact

Please contact Cagri Toraman (cagritoraman@gmail.com) in case of any issues with the datasets.

Citation

If you make use of this dataset, please cite following paper.

@InProceedings{toraman2022large,
  author    = {Toraman, Cagri  and  \c{S}ahinu\c{c}, Furkan and Yilmaz, Eyup Halit},
  title     = {Large-Scale Hate Speech Detection with Cross-Domain Transfer},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference},
  month     = {June},
  year      = {2022},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2215--2225},
  url       = {https://aclanthology.org/2022.lrec-1.238}
}

About

Turkish and English Dataset from "Large-Scale Hate Speech Detection with Cross-Domain Transfer"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published