Datasets from Related Literature
In this repository , we present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying , abusive language , online harassment , among others, to make it easier for researchers to obtain datasets.
Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.
No
Datasets (Link to paper)
Objects
Size
Available
Labels
Comment
1
Dinakar et al., 2011
YouTube Comments
6000
-
Sexuality, Race, Culture, Intelligence
2
Dadvar and Jong, 2012
Myspace Posts
2200
-
Bullying, Non Bullying
3
Huang et al., 2014
Tweets
4865
-
Bullying, Non Bullying
4
Hosseinmardi et al., 2015
Instagram Media Sessions
998
-
bullying, Non bullying
5
Waseem and Hovy, 2016
Tweets
16914
Download
Racist, Sexist, Either
6
Waseem, 2016
Tweets
6909
Download
Racist, Sexist, Either,Both
7
Nobata et al., 2016
Yahoo Comments
2000
-
Abusive, Clean
8
Chatzakou et al., 2017
Twitter Users
9484
-
Aggressor, Bully, Spammer
9
Davidson et al., 2017
Tweets
24802
Download
hate_speech, offensive, neither
10
Golbeck et al., 2017
Tweets
35000
-
Harassing, Non Harassing
11
Wulczyn et al. 2017
Wikipedia Comments
100000
Download
Personal Attacks
12
Tahmasbi and Rastegari, 2018
Tweets
12837
-
Bullying, Non Bullying
13
Anzovino et al., 2018
Tweets
4454
-
Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy
14
Founta et al., 2018
Tweets
80000
Download
Hate Speech, Offensive, None
15
Gibert et al., 2018
Sentences from Stormfront
10568
Download
Hate Speech, Non Hate Speech
16
SemEval19, 2019
Tweets
9000
Request Link
Hate speech, Non Hate Speech
17
OLID 2019
Tweets
14100
Download
Offensive, Non Offensive
18
TREC2 2020
Messages (Twitter,Facebook,Youtube)
4,263
Request Form
Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)
Data GeoLocated India
19
meTooMA 2020
Tweets
9,973
Download
Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither)
Data GeoLocated India, Australia, Kenya, Iran, UK
No
Datasets (Link to paper)
Objects
Size
Available
Labels
1
Mubarak et al., 2017
Tweets
1100
Download
Obscene, Offensive but not obscene, Clean
2
Albadi et al., 2018
Tweets
6136
Download
Hate Speech, Non Hate Speech
3
Alakrot A. et al., 2018
Tweets
15050
Download
Offensive, Not Offensive
4
Ousidhoum et al., 2019
Tweets
3353
Download
Hate Speech, Non Hate Speech
5
L-HSAB, 2019
Tweets
5846
Download
Normal, Abuse, Hate Speech
No
Datasets (Link to paper)
Objects
Size
Available
Language
Labels
1
Hee et al., 2015
Ask.fm Posts
85485
-
Dutch
Threat-Blackmail, Sexual-talk, Insult, Curse-Exclusion, Defense, Defamation-Encouragement
2
Papegnies et al., 2017
Game Chat Logs
2779
-
French
Abusive, Non Abusive
3
Sirihattasak et al., 2018
Tweets
3,300
Yes
Thai
Toxic, Non Toxic
4
Bohra et al., 2018
Tweets
4575
Yes
Hindi-English
Hate Speech, Non Hate Speech
5
Fortuna et al., 2019
Tweets
5668
Download
Portuguese
Hate Speech (81 categories), Non Hate Speech
6
TREC2 2020
Messages (Twitter,Facebook,Youtube)
3,984
Request Form
Hindi
Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)
8
TREC2 2020
Messages (Twitter,Facebook,Youtube)
3,826
Request Form
Bangla
Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)
Multilingual (Parallel Data)
No
Datasets (Link to paper)
Objects
Size
Available
Language
Labels
1
XHate 999
Tweets from previous published English datasets and translated to 5 languages
600 (x 6 languages)
Download
English, German, Russian, Croatian, Albanian, Turkish
sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults.