-
-
Notifications
You must be signed in to change notification settings - Fork 357
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #977 from adwityac/adwityac
Add Hate Speech Detection
- Loading branch information
Showing
10 changed files
with
28,764 additions
and
0 deletions.
There are no files selected for viewing
26,954 changes: 26,954 additions & 0 deletions
26,954
Hate Speech Detection/Dataset/Dataset---Hate-Speech-Detection-using-Deep-Learning.csv
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,565 changes: 1,565 additions & 0 deletions
1,565
Hate Speech Detection/Model/Hate_Speech_Detection_using_Deep_Learning.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
## **Hate Speech Detection** | ||
|
||
### 🎯 **Goal** | ||
|
||
The main goal of the project was to develop a deep learning model that accurately identifies and classifies hate speech in text data and to help identify and filter harmful language, promoting safer and more respectful online interactions. | ||
|
||
### 🧵 **Dataset** | ||
|
||
The dataset is taken from CrowdFlower - https://data.world/crowdflower/hate-speech-identification | ||
|
||
### 🧾 **Description** | ||
|
||
This project focuses on detecting hate speech in text using deep learning techniques. It involves preprocessing text data, training a neural network model, and evaluating its performance in classifying content as either hate speech or non-hate speech. The model aims to enhance online content moderation by identifying harmful language effectively, contributing to safer digital spaces. | ||
|
||
### 🧮 **What I had done!** | ||
|
||
1. **Data Loading**: Import the labeled text dataset. | ||
2. **Preprocessing**: Clean text by removing noise, tokenizing, and normalizing. | ||
3. **EDA**: Analyze class distribution and visualize data patterns. | ||
4. **Model Building**: Create a neural network with embedding and LSTM layers. | ||
5. **Training**: Train the model with a split of training and validation data. | ||
6. **Evaluation**: Assess performance using metrics like accuracy and F1-score. | ||
7. **Visualization**: Plot accuracy and loss to check model performance. | ||
8. **Prediction**: Use the model to classify new text as hate speech or non-hate speech. | ||
|
||
### 🚀 **Models Implemented** | ||
|
||
The project uses an LSTM (Long Short-Term Memory) model with an embedding layer to detect hate speech. LSTM was chosen because it effectively captures the context and long-term dependencies in sequential text data, making it well-suited for understanding language patterns. The embedding layer helps convert words into dense vectors, enhancing the model's ability to grasp semantic relationships, while a final dense layer with a sigmoid activation performs binary classification of the text. | ||
|
||
### 📚 **Libraries Needed** | ||
|
||
Here are all the libraries used in this project: | ||
|
||
1. **NumPy**: For numerical operations and array handling. | ||
2. **Pandas**: For data manipulation and analysis. | ||
3. **Matplotlib**: For creating visualizations and plots. | ||
4. **Seaborn**: For statistical data visualization. | ||
5. **NLTK (Natural Language Toolkit)**: For text preprocessing tasks like tokenization and stopword removal. | ||
6. **Scikit-learn**: For data splitting, metrics evaluation, and preprocessing utilities. | ||
7. **TensorFlow/Keras**: For building and training the deep learning model. | ||
8. **re (Regular Expressions)**: For text cleaning and preprocessing. | ||
9. **String**: For handling text processing tasks. | ||
|
||
### 📊 **Exploratory Data Analysis Results** | ||
![model_deployment_01](https://github.com/user-attachments/assets/1c8cb248-9ff1-4dd3-af0f-f00e080854f9) | ||
![model_deployment_02](https://github.com/user-attachments/assets/341dab93-3293-4f2e-9a8f-1464a2b4a57a) | ||
|
||
|
||
### 📈 **Performance of the Models based on the Accuracy Scores** | ||
|
||
The project used an **LSTM (Long Short-Term Memory) Network** as the main algorithm. It achieved an accuracy of approximately **85%** on the test dataset. The results indicated a strong performance in detecting hate speech, with balanced precision, recall, and F1-score, showcasing its effectiveness in handling complex and context-dependent text data. | ||
|
||
|
||
### 📢 **Conclusion** | ||
|
||
Differentiating hate speech from offensive language is a challenging task. Our approach, which involves text pre-processing and feature extraction (e.g., n-gram tf-idf, sentiment polarity, doc2vec, and readability scores), demonstrates the benefits of using these features for classification. The evaluation of models based on accuracy and F1-scores highlights the complexity of the problem. While the results show the potential of the proposed features, further analysis and error review could improve feature extraction methods and help address existing challenges in detecting toxic language on platforms like Twitter. | ||
|
||
### ✒️ **Your Signature** | ||
|
||
Adwitya Chakraborty |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
from flask import Flask, render_template, request, jsonify | ||
from flask_wtf import FlaskForm | ||
from wtforms import StringField, SubmitField | ||
from wtforms.validators import DataRequired | ||
import tensorflow as tf | ||
import tensorflow_text # prerequisite for using the BERT preprocessing layer | ||
import numpy as np | ||
from dotenv import load_dotenv | ||
import os | ||
|
||
# Load environment variables from .env file | ||
load_dotenv() | ||
|
||
# Create the Flask web application | ||
app = Flask(__name__) | ||
|
||
# Set a secret key (stored in .env) as a security measure (e.g. protecting against CSRF attacks) | ||
app.config["SECRET_KEY"] = os.getenv("SECRET_KEY") | ||
|
||
# Load the TensorFlow model | ||
model = tf.keras.models.load_model("saved_models/model3") | ||
|
||
|
||
# Create hate speech detection form class (that inherits from the Flask WTForm class) | ||
class HateSpeechForm(FlaskForm): | ||
comment = StringField("Social Media Comment", validators=[DataRequired()]) | ||
submit = SubmitField("Run") | ||
|
||
|
||
# Home route | ||
@app.route("/", methods=["GET", "POST"]) | ||
def home(): | ||
# Instantiate a hate speech form class object | ||
form = HateSpeechForm() | ||
# If the user submitted valid information in the hate speech form | ||
if form.validate_on_submit(): | ||
# Get the input text from the form | ||
input_text = form.comment.data | ||
# Convert input text to a list | ||
input_data = [input_text] | ||
# Make prediction using the TensorFlow model | ||
prediction_prob = model.predict(input_data)[0][0] | ||
# Convert prediction probability to percent | ||
prediction_prob = np.round(prediction_prob * 100, 1) | ||
# Convert prediction probability to prediction in text form | ||
if prediction_prob >= 50: | ||
prediction = "Hate Speech" | ||
else: | ||
prediction = "No Hate Speech" | ||
# Invert the prediction probability | ||
prediction_prob = 100 - prediction_prob | ||
# Render the prediction and prediction probability in the index.html template | ||
return render_template("index.html", | ||
form=form, | ||
prediction=prediction, | ||
prediction_prob=prediction_prob) | ||
return render_template("index.html", form=form) | ||
|
||
|
||
# API route | ||
@app.route("/api") | ||
def prediction_by_api(): | ||
# Get the input text from the api query parameter | ||
input_text = request.args.get("comment") | ||
# Convert input text to a list | ||
input_data = [input_text] | ||
# Make prediction using the TensorFlow model | ||
prediction_prob = model.predict(input_data)[0][0] | ||
# Convert prediction probability to prediction in text form | ||
if prediction_prob >= 0.5: | ||
prediction = "Hate Speech" | ||
else: | ||
prediction = "No Hate Speech" | ||
# Invert the prediction probability | ||
prediction_prob = 1 - prediction_prob | ||
# Return json with the prediction and prediction probability | ||
return jsonify({"prediction": prediction, | ||
"probability": float(prediction_prob)}) | ||
|
||
|
||
# Start the Flask web application | ||
if __name__ == "__main__": | ||
app.run(debug=True) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
## Hate Speech Detection | ||
|
||
### Goal 🎯 | ||
The main goal of the project was to develop a deep learning model that accurately identifies and classifies hate speech in text data and to help identify and filter harmful language, promoting safer and more respectful online interactions. | ||
|
||
### Model(s) used for the Web App 🧮 | ||
The application uses a TensorFlow model with BERT for binary classification of hate speech vs non-hate speech. The model produces probabilities between 0-1, with 0.5 as the decision threshold. | ||
|
||
### Video Demonstration 🎥 | ||
![model_deployment_api](https://github.com/user-attachments/assets/e89599e4-8271-4c65-aefd-17078c1fc9c9) | ||
|
||
|
||
### Signature ✒️ | ||
Adwitya Chakraborty |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
absl-py==1.4.0 | ||
asttokens==2.2.1 | ||
astunparse==1.6.3 | ||
backcall==0.2.0 | ||
cachetools==5.3.1 | ||
certifi==2023.5.7 | ||
charset-normalizer==3.2.0 | ||
click==8.1.4 | ||
cloudpickle==2.2.1 | ||
colorama==0.4.6 | ||
comm==0.1.3 | ||
debugpy==1.6.7 | ||
decorator==5.1.1 | ||
executing==1.2.0 | ||
Flask==1.1.2 | ||
Flask-WTF==0.14.3 | ||
flatbuffers==23.5.26 | ||
gast==0.4.0 | ||
google-auth==2.22.0 | ||
google-auth-oauthlib==0.4.6 | ||
google-pasta==0.2.0 | ||
grpcio==1.56.0 | ||
gunicorn==20.1.0 | ||
h5py==3.8.0 | ||
idna==3.4 | ||
importlib-metadata==6.7.0 | ||
ipykernel==6.16.2 | ||
ipython==7.34.0 | ||
itsdangerous==2.0.1 | ||
jedi==0.18.2 | ||
Jinja2==3.0.0 | ||
jupyter_client==8.0.0a1 | ||
jupyter_core==5.0.0rc2 | ||
keras==2.10.0 | ||
Keras-Preprocessing==1.1.2 | ||
libclang==16.0.0 | ||
Markdown==3.4.3 | ||
MarkupSafe==2.1.3 | ||
matplotlib-inline==0.1.6 | ||
nest-asyncio==1.5.6 | ||
numpy==1.21.6 | ||
oauthlib==3.2.2 | ||
opt-einsum==3.3.0 | ||
packaging==23.1 | ||
parso==0.8.3 | ||
pickleshare==0.7.5 | ||
platformdirs==3.8.1 | ||
prompt-toolkit==3.0.39 | ||
protobuf==3.19.6 | ||
psutil==5.9.5 | ||
pure-eval==0.2.2 | ||
pyasn1==0.5.0 | ||
pyasn1-modules==0.3.0 | ||
Pygments==2.15.1 | ||
python-dateutil==2.8.2 | ||
python-dotenv==0.21.1 | ||
pyzmq==25.1.0 | ||
requests==2.31.0 | ||
requests-oauthlib==1.3.1 | ||
rsa==4.9 | ||
six==1.16.0 | ||
spyder-kernels==2.2.0 | ||
stack-data==0.6.2 | ||
tensorboard==2.10.1 | ||
tensorboard-data-server==0.6.1 | ||
tensorboard-plugin-wit==1.8.1 | ||
tensorflow==2.10.0 | ||
tensorflow-estimator==2.10.0 | ||
tensorflow-hub==0.13.0 | ||
tensorflow-io-gcs-filesystem==0.31.0 | ||
tensorflow-text==2.10.0 | ||
termcolor==2.3.0 | ||
tornado==6.2 | ||
traitlets==5.9.0 | ||
typing_extensions==4.7.1 | ||
urllib3==1.26.16 | ||
wcwidth==0.2.6 | ||
Werkzeug==2.0.3 | ||
wrapt==1.15.0 | ||
WTForms==2.3.3 | ||
zipp==3.15.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
numpy | ||
pandas | ||
matplotlib | ||
seaborn | ||
scikit-learn | ||
nltk | ||
tensorflow |