Skip to content

dyasutomo/ToxCom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ToxCom

This is a project for the Erdos pilot-NLP bootcamp that occured over the course of March-April 2021.

You can see our final presentation here

Team Members

Ghanashyam Khanal Dyas Utomo

Project Motivation and Introduction

Abuse and harassment online means many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

Our goal is to predict negative comments and label those comments into 6 categories: toxic, severe toxic, obscene, threat, insult, and identity hate. So, this is a multi-label classification problem.

We obtained 159,511 comments from Wikipedia’s talk page edits[1] with length 1 to 1250 tokens per comment and 167,897 unique tokens (outside stopwords, etc.). Data downloaded from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

Exploratory data analysis [EDA]

  • Analyzed the Wordcloud for only Toxic comments vs all comments
  • Bar chart of most used words for the toxic comments
  • Calculated the weighted average of number of words using apostrophe per comment to differentiate between the toxic vs non-toxic comments.

Dealing with Data Imbalance

  • The data set was an extreme case of class imbalance: ~90% data was non-toxic.
  • We focused on the minory class when modeling (via loss_weight and scoring params)
  • Analyze Precision and Recall as opposed to acccuracy due to the imbalanced data.

Feature Processing

  • Processed the text data by removing \n, apostrophe, stopwords and stemming.
  • Performed Tokenization of clean data to prepare text-to-matrix, TF-IDF and Word2Vec matrices.

Modeling

  • We used multiclass classifier with different models.
  • Performed the binary (toxic vs non-toxic) as well as multiclass models.
  • Used model: Logistic Regression, Decision Tree, Neural Network (with Dense layer)
  • Updated the model with complex parameter tuning.

Result

  • So far complex Logistic Regression performed best for the binary case.
  • Best multiclass recall score was 63%. and best precision score was 75%.

Deployment to the web

An example of non-toxic comment on the webapp. Non-toxic comment

An example of intermediate comment on the webapp. Intermediate comment

An example of Toxic comment on the webapp. Toxic comment

About

Erdos NLP project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published