Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spam Vs Ham Mail Classification [With Streamlit GUI] #329 #491

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 74 additions & 0 deletions Shoe vs Sandal vs Boot Image Classification/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Shoe, Sandal, and Boot Image Classification Project<br>

## 🎯 Goal<br>
The main goal of this project is to develop an image classification system capable of distinguishing between shoes, sandals, and boots. The purpose is to create models that can accurately categorize footwear images into predefined classes, aiding in retail and fashion industry applications.<br>

## 🧵 Dataset<br>
The dataset used for this project consists of a collection of footwear images sourced from [Dataset Link](https://www.kaggle.com/datasets/hasibalmuzdadid/shoe-vs-sandal-vs-boot-dataset-15k-images). The dataset is curated to include various styles of shoes, sandals, and boots with labeled classes.<br>

## 🧾 Description<br>
This project focuses on using machine learning techniques to build robust image classification models for footwear. The developed models aim to identify different types of footwear based on visual features extracted from images.<br>

## 🧮 What I had done!<br>
1. **Data Preprocessing:**
- Resized images to a standard input size for model compatibility.
- Normalized pixel values to the range [0, 1].

2. **Model Architectures:**
- Custom Convolutional Neural Network (CNN): Designed for specific features of footwear images.
- Transfer Learning: Utilized the pre-trained VGG16 architecture for enhanced performance.

3. **Data Augmentation:**
- Applied image data augmentation techniques using TensorFlow's ImageDataGenerator to improve model generalization.

4. **Model Training:**
- Trained the custom CNN and VGG16 models on the preprocessed and augmented dataset.
- Evaluated models on a validation set to monitor performance.

5. **Save Models:**
- Saved the trained models for future use and predictions.

## 🚀 Models Implemented<br>
- Custom Convolutional Neural Network (CNN)
- VGG16-based Convolutional Neural Network

**Why these models:**
- Custom CNN: Designed for dataset-specific features.
- VGG16: Known for simplicity and effectiveness in image classification tasks.

## 📚 Libraries Needed
1. TensorFlow
2. Matplotlib
3. Numpy
4. Scikit-learn (for additional evaluation metrics if required)

## 📊 Exploratory Data Analysis Results


**Distribution of Classes**
To gain an understanding of the dataset, we analyzed the distribution of images across different classes.

| Class | Number of Images |
|---------|-------------------|
| Shoe | 5000 |
| Sandal | 5000 |
| Boot | 5000 |

| Model | Accuracy |
|---------|-------------------|
| Custom CNN | 78% |
| VGG 16 | 98% |

*INSIGHT* : VGG16 has the better accuracy . <br>

## 📈 Performance of the Models based on Accuracy Scores<br>
- Custom CNN Model Accuracy: XX%
- VGG16 Model Accuracy: XX%

## 📢 Conclusion<br>
The footwear image classification project demonstrates effective learning and categorization across different types of footwear. The models show promising accuracy, with the [best_model_name] achieving the highest accuracy of XX%. These results indicate the potential of the models for real-world applications in footwear recognition.<br>

## ✒️ Your Signature<br>
Dipayan Majumder<br>
[dipayan22](https://github.com/dipayan22/)<br>

386 changes: 386 additions & 0 deletions Shoe vs Sandal vs Boot Image Classification/Model/model1.ipynb

Large diffs are not rendered by default.

269 changes: 269 additions & 0 deletions Shoe vs Sandal vs Boot Image Classification/Model/model2.ipynb

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions Shoe vs Sandal vs Boot Image Classification/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
TensorFlow
Matplotlib
Numpy
Scikit-learn
5,172 changes: 5,172 additions & 0 deletions Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/newData.csv

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# SPAM vs HAM Email Classification<br>

## 🎯 Goal<br>

The main goal of this project is to develop a robust deep learning model for classifying emails as either spam (SPAM) or legitimate (HAM). Additionally, the project aims to create a Streamlit GUI for a user-friendly interface in real-time email classification.<br>

## 🧵 Dataset<br>

The dataset used for this project is available [here](https://www.kaggle.com/datasets/omokennanna/simple-spam-classification). It consists of two columns: 'text' containing the email content and 'label' indicating whether the email is spam or ham.<br>

## 🧾 Description<br>

This project utilizes a deep learning model with an embedding layer, bidirectional LSTM layers, and a dense output layer. The model is trained on the provided dataset to classify emails as spam or ham. A Streamlit GUI is implemented to enable users to perform real-time email classification.<br>

## 🧮 What I had done!<br>

### 1. Data Preparation<br>

- The dataset is loaded and split into training and testing sets.<br>
- Missing values are handled, and the text data is preprocessed.<br>

### 2. Model Architecture<br>

- The Machine learning model I used several algorithm for better accuracy.<br>

- The deep learning model comprises an embedding layer, LSTM layers, and a dense output layer with a sigmoid activation function.<br>

- The deep learning model comprises an embedding layer, bidirectional LSTM layers, and a dense output layer with a sigmoid activation function.<br>

### 3. Training the Model<br>

The model is trained on the preprocessed dataset using the Adam optimizer and binary crossentropy loss. The training process is monitored for convergence and effectiveness.<br>

### 4. Streamlit GUI<br>

A Streamlit GUI is implemented for real-time email classification. Users can input an email, and the model predicts whether it is spam or ham.<br>

## 🚀 Models Implemented<br>

1. Machine Learning Model
2. Deep Learning Model with LSTM Layers
3. Deep Learning Model with Bidirectional LSTM Layers

**Why these models:**<br>

1. **Machine Learning Model:**<br>

This traditional machine learning model serves as a baseline and allows us to compare the performance of deep learning models against a more conventional approach.<br>

2. **Deep Learning Model with LSTM Layers:**<br>

LSTM layers are particularly effective for sequential data, making them suitable for capturing long-range dependencies and patterns within the input data.<br>

3. **Deep Learning Model with Bidirectional LSTM Layers:**<br>

Bidirectional LSTM layers enhance the LSTM model by processing sequences in both forward and backward directions, allowing the model to capture information from past and future time steps simultaneously.<br>

## 📚 Libraries Needed<br>

1. TensorFlow
2. scikit-learn
3. pandas
4. matplotlib
5. seaborn
6. streamlit

## 📊 Exploratory Data Analysis Results<br>

### Insight<br>

In the Dataset, we have 88% Ham Data and 12% Spam Data. The distribution of classes is imbalanced, which creates a challenge in accurately classifying emails.<br>

![Spam vs Ham dataset](./../Image/Spam-vs-ham-piechart.jpg)

![Pairplot of Dataset](./../Image/PairPlot_withHue.png)



| Model | Accuracy Score |
| ---------------------------------- | -------------- |
| Machine Learning Model (BernoulliNB)| 96% |
| Deep Learning Model (LSTM) | 88.58% |
| Deep Learning Model (Bidirectional LSTM)| 98.56% |

## 📈 Performance of the Models based on the Accuracy Scores<br>

1. Machine Learning Model (BernoulliNB) : 96%
2. Deep Learning Model (LSTM) : 88.58%
3. Deep Learning Model (Bidirectional LSTM) : 98.56%

## 📢 Conclusion<br>

The SPAM vs HAM Email Classification project, coupled with the Streamlit GUI, provides an effective solution for real-time email categorization. The deep learning model demonstrates promising accuracy, and the user-friendly interface makes it accessible for practical use.<br>

## ✒️ Your Signature<br>

Dipayan Majumder<br>
[GitHub: dipayan22](https://github.com/dipayan22)
61 changes: 61 additions & 0 deletions Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# This Streamlit App is for the Machine Learning model.


import streamlit as st
from nltk.stem import WordNetLemmatizer
import pickle
import nltk
import string
from nltk.corpus import stopwords


lemmatizer = WordNetLemmatizer()

# function of Data Preprocessing.
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)

y = []
for i in text:
if i.isalnum():
y.append(i)

text = y[:]
y.clear()

for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)

text = y[:]
y.clear()

for i in text:
y.append(lemmatizer.lemmatize(i))


return " ".join(y)


# Store the model in your file
# here we can store the tfidf and model pkl file in a specfic folder and use it.
tfidf=pickle.load(open('vectorizer.pkl','rb'))
model=pickle.load(open('bnb.pkl','rb'))

st.title('SMS Spam Classification')

sms_input=st.text_area("Enter the text")

if st.button('Predict'):
transform_sms=transform_text(sms_input)

vector_input=tfidf.transform([transform_sms])

result=model.predict(vector_input)[0]

if result==1:
st.title("SMS is Spam")

else:
st.title("SMS is not Spam")
40 changes: 40 additions & 0 deletions Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This Streamlit GUI is used for the Deep Learning Model.

import streamlit as st
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# # Load the trained model
model2 = tf.keras.models.load_model('path/to/your/trained/model')

# Load the tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(['your', 'list', 'of', 'common', 'words'])

# Define the maximum sequence length (adjust based on your model)
max_length = 100

# Streamlit App
def main():
st.title("SPAM vs HAM Email Classification")

# User input
user_input = st.text_area("Enter the email text:")

if st.button("Predict"):
# Tokenize and pad the input text
input_sequence = tokenizer.texts_to_sequences([user_input])
padded_input = pad_sequences(input_sequence, maxlen=max_length, padding='post', truncating='post')

# Make the prediction
prediction = model2.predict(padded_input)

# Display the result
if prediction[0][0] > 0.5:
st.success("Prediction: HAM (Legitimate Email)")
else:
st.error("Prediction: SPAM")

if __name__ == '__main__':
main()
Loading
Loading