In this repository, we present information on datasets that have been used for hate speech detection or related concepts such as cyberbullying, abusive language, online harassment, among others, to make it easier for researchers to obtain datasets.
Even when there are several social media platforms to get data from, the construction of a balanced labeled dataset is a costly task in time and effort, and it is still a problem for the researchers in the area. Although most of the below-listed datasets are not explicitly available, some of them can be obtained from the authors if requested.
No | Datasets (Link to paper) | Objects | Size | Available | Labels |
---|---|---|---|---|---|
1 | IberEval 2018 | Tweets | 4138 | Download | Misogeny (5 categories), Not Misogeny |
2 | MEX-A3T | Tweets | 11000 | Download | Aggressive, Not Aggressive |
3 | SemEval19, 2019 | Tweets | 4500 | Request Link | Hate Speech, Non Hate Speech |
4 | Pereira et al., 2019 | Tweets | 6000 | Download | Hate Speech, Non Hate Speech |
5 | Chilean Dataset | Tweets | 9834 | Download | Several Categories including hate speech |
No | Datasets (Link to paper) | Objects | Size | Available | Labels |
---|---|---|---|---|---|
1 | Sanguinetti et al., 2018 | Tweets | 6929 | Download | Hate Speech, Non Hate Speech |
2 | EVALITA 2018 | Facebook Posts | 4000 | Download | No Hate, Weak Hate, Strong Hate |
3 | EVALITA 2018 | Tweets | 4000 | Download | Hate Speech, Non Hate Speech |
4 | EVALITA 2020 | Tweets | 6839 | Request Link | Hate Speech, Non Hate Speech |
No | Datasets (Link to paper) | Objects | Size | Available | Labels | Comment |
---|---|---|---|---|---|---|
1 | Dinakar et al., 2011 | YouTube Comments | 6000 | - | Sexuality, Race, Culture, Intelligence | |
2 | Dadvar and Jong, 2012 | Myspace Posts | 2200 | - | Bullying, Non Bullying | |
3 | Huang et al., 2014 | Tweets | 4865 | - | Bullying, Non Bullying | |
4 | Hosseinmardi et al., 2015 | Instagram Media Sessions | 998 | - | bullying, Non bullying | |
5 | Waseem and Hovy, 2016 | Tweets | 16914 | Download | Racist, Sexist, Either | |
6 | Waseem, 2016 | Tweets | 6909 | Download | Racist, Sexist, Either,Both | |
7 | Nobata et al., 2016 | Yahoo Comments | 2000 | - | Abusive, Clean | |
8 | Chatzakou et al., 2017 | Twitter Users | 9484 | - | Aggressor, Bully, Spammer | |
9 | Davidson et al., 2017 | Tweets | 24802 | Download | hate_speech, offensive, neither | |
10 | Golbeck et al., 2017 | Tweets | 35000 | - | Harassing, Non Harassing | |
11 | Wulczyn et al. 2017 | Wikipedia Comments | 100000 | Download | Personal Attacks | |
12 | Tahmasbi and Rastegari, 2018 | Tweets | 12837 | - | Bullying, Non Bullying | |
13 | Anzovino et al., 2018 | Tweets | 4454 | - | Discredit, Stereotype, Objectification, Sexual_Harassment, Threats of Violence, Dominance, Dearailingy | |
14 | Founta et al., 2018 | Tweets | 80000 | Download | Hate Speech, Offensive, None | |
15 | Gibert et al., 2018 | Sentences from Stormfront | 10568 | Download | Hate Speech, Non Hate Speech | |
16 | SemEval19, 2019 | Tweets | 9000 | Request Link | Hate speech, Non Hate Speech | |
17 | OLID 2019 | Tweets | 14100 | Download | Offensive, Non Offensive | |
18 | TREC2 2020 | Messages (Twitter,Facebook,Youtube) | 4,263 | Request Form | Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG) | Data GeoLocated India |
19 | meTooMA 2020 | Tweets | 9,973 | Download | Hate Speech (Directed, Generalized), Relevance (0,1), STANCE (Support, Opposition, Neither) | Data GeoLocated India, Australia, Kenya, Iran, UK |
No | Datasets (Link to paper) | Objects | Size | Available | Labels |
---|---|---|---|---|---|
1 | Mubarak et al., 2017 | Tweets | 1100 | Download | Obscene, Offensive but not obscene, Clean |
2 | Albadi et al., 2018 | Tweets | 6136 | Download | Hate Speech, Non Hate Speech |
3 | Alakrot A. et al., 2018 | Tweets | 15050 | Download | Offensive, Not Offensive |
4 | Ousidhoum et al., 2019 | Tweets | 3353 | Download | Hate Speech, Non Hate Speech |
5 | L-HSAB, 2019 | Tweets | 5846 | Download | Normal, Abuse, Hate Speech |
No | Datasets (Link to paper) | Objects | Size | Available | Language | Labels |
---|---|---|---|---|---|---|
1 | Hee et al., 2015 | Ask.fm Posts | 85485 | - | Dutch | Threat-Blackmail, Sexual-talk, Insult, Curse-Exclusion, Defense, Defamation-Encouragement |
2 | Papegnies et al., 2017 | Game Chat Logs | 2779 | - | French | Abusive, Non Abusive |
3 | Sirihattasak et al., 2018 | Tweets | 3,300 | Yes | Thai | Toxic, Non Toxic |
4 | Bohra et al., 2018 | Tweets | 4575 | Yes | Hindi-English | Hate Speech, Non Hate Speech |
5 | Fortuna et al., 2019 | Tweets | 5668 | Download | Portuguese | Hate Speech (81 categories), Non Hate Speech |
6 | TREC2 2020 | Messages (Twitter,Facebook,Youtube) | 3,984 | Request Form | Hindi | Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG) |
8 | TREC2 2020 | Messages (Twitter,Facebook,Youtube) | 3,826 | Request Form | Bangla | Misogynous (GEN,NGEN), AGGRESSION LEVEL(OAG, CAG, NAG) |
No | Datasets (Link to paper) | Objects | Size | Available | Language | Labels |
---|---|---|---|---|---|---|
1 | XHate 999 | Tweets from previous published English datasets and translated to 5 languages | 600 (x 6 languages) | Download | English, German, Russian, Croatian, Albanian, Turkish | sexism, racism, toxicity, hatefulness, aggression, attack, cyberbullying, misogyny, obscenity, threats, and insults. |
No | Datasets (Link to paper) | Objects | Size | Available | Language | Labels |
---|---|---|---|---|---|---|
1 | Kiela et al., 2020 | Memes (Image + Text) | 10000 | Competition link | Texts in English | Hate, No Hate |
2 | Pramanick1 et al., 2021 | Memes (Image + Text) | 3544 | Download | Texts in English | somewhat harmful, not harmful, very harmful |