Datasets from Related Literature

In this repository, we present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying, abusive language, online harassment, among others, to make it easier for researchers to obtain datasets.

Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.

Spanish

No	Datasets (Link to paper)	Objects	Size	Available	Labels
1	IberEval 2018	Tweets	4138	Download	Misogeny (5 categories), Not Misogeny
2	MEX-A3T	Tweets	11000	Download	Aggressive, Not Aggressive
3	SemEval19, 2019	Tweets	4500	Request Link	Hate Speech, Non Hate Speech
4	Pereira et al., 2019	Tweets	6000	Download	Hate Speech, Non Hate Speech
5	Chilean Dataset	Tweets	9834	Download	Several Categories including hate speech

Italian

No	Datasets (Link to paper)	Objects	Size	Available	Labels
1	Sanguinetti et al., 2018	Tweets	6929	Download	Hate Speech, Non Hate Speech
2	EVALITA 2018	Facebook Posts	4000	Download	No Hate, Weak Hate, Strong Hate
3	EVALITA 2018	Tweets	4000	Download	Hate Speech, Non Hate Speech
4	EVALITA 2020	Tweets	6839	Request Link	Hate Speech, Non Hate Speech

English

No	Datasets (Link to paper)	Objects	Size	Available	Labels	Comment
1	Dinakar et al., 2011	YouTube Comments	6000	-	Sexuality, Race, Culture, Intelligence
2	Dadvar and Jong, 2012	Myspace Posts	2200	-	Bullying, Non Bullying
3	Huang et al., 2014	Tweets	4865	-	Bullying, Non Bullying
4	Hosseinmardi et al., 2015	Instagram Media Sessions	998	-	bullying, Non bullying
5	Waseem and Hovy, 2016	Tweets	16914	Download	Racist, Sexist, Either
6	Waseem, 2016	Tweets	6909	Download	Racist, Sexist, Either,Both
7	Nobata et al., 2016	Yahoo Comments	2000	-	Abusive, Clean
8	Chatzakou et al., 2017	Twitter Users	9484	-	Aggressor, Bully, Spammer
9	Davidson et al., 2017	Tweets	24802	Download	hate_speech, offensive, neither
10	Golbeck et al., 2017	Tweets	35000	-	Harassing, Non Harassing
11	Wulczyn et al. 2017	Wikipedia Comments	100000	Download	Personal Attacks
12	Tahmasbi and Rastegari, 2018	Tweets	12837	-	Bullying, Non Bullying
13	Anzovino et al., 2018	Tweets	4454	-	Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy
14	Founta et al., 2018	Tweets	80000	Download	Hate Speech, Offensive, None
15	Gibert et al., 2018	Sentences from Stormfront	10568	Download	Hate Speech, Non Hate Speech
16	SemEval19, 2019	Tweets	9000	Request Link	Hate speech, Non Hate Speech
17	OLID 2019	Tweets	14100	Download	Offensive, Non Offensive
18	TREC2 2020	Messages (Twitter,Facebook,Youtube)	4,263	Request Form	Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)	Data GeoLocated India
19	meTooMA 2020	Tweets	9,973	Download	Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither)	Data GeoLocated India, Australia, Kenya, Iran, UK

Arabic

No	Datasets (Link to paper)	Objects	Size	Available	Labels
1	Mubarak et al., 2017	Tweets	1100	Download	Obscene, Offensive but not obscene, Clean
2	Albadi et al., 2018	Tweets	6136	Download	Hate Speech, Non Hate Speech
3	Alakrot A. et al., 2018	Tweets	15050	Download	Offensive, Not Offensive
4	Ousidhoum et al., 2019	Tweets	3353	Download	Hate Speech, Non Hate Speech
5	L-HSAB, 2019	Tweets	5846	Download	Normal, Abuse, Hate Speech

Other languages

No	Datasets (Link to paper)	Objects	Size	Available	Language	Labels
1	Hee et al., 2015	Ask.fm Posts	85485	-	Dutch	Threat-Blackmail, Sexual-talk, Insult, Curse-Exclusion, Defense, Defamation-Encouragement
2	Papegnies et al., 2017	Game Chat Logs	2779	-	French	Abusive, Non Abusive
3	Sirihattasak et al., 2018	Tweets	3,300	Yes	Thai	Toxic, Non Toxic
4	Bohra et al., 2018	Tweets	4575	Yes	Hindi-English	Hate Speech, Non Hate Speech
5	Fortuna et al., 2019	Tweets	5668	Download	Portuguese	Hate Speech (81 categories), Non Hate Speech
6	TREC2 2020	Messages (Twitter,Facebook,Youtube)	3,984	Request Form	Hindi	Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)
8	TREC2 2020	Messages (Twitter,Facebook,Youtube)	3,826	Request Form	Bangla	Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG)

Multilingual (Parallel Data)

No	Datasets (Link to paper)	Objects	Size	Available	Language	Labels
1	XHate 999	Tweets from previous published English datasets and translated to 5 languages	600 (x 6 languages)	Download	English, German, Russian, Croatian, Albanian, Turkish	sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults.

Multimodal Datasets

No	Datasets (Link to paper)	Objects	Size	Available	Language	Labels
1	Kiela et al., 2020	Memes (Image + Text)	10000	Competition link	Texts in English	Hate, No Hate
2	Pramanick1 et al., 2021	Memes (Image + Text)	3544	Download	Texts in English	somewhat harmful, not harmful, very harmful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Datasets from Related Literature

Spanish

Italian

English

Arabic

Other languages

Multilingual (Parallel Data)

Multimodal Datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Datasets from Related Literature

Spanish

Italian

English

Arabic

Other languages

Multilingual (Parallel Data)

Multimodal Datasets