Merge pull request #677 from Karthik110505/main

Real or Fake Job Posting Prediction
abhisheks008 · Jun 9, 2024 · cc3e2ec · cc3e2ec
2 parents 00fad73 + 2a51fdd
commit cc3e2ec
Show file tree

Hide file tree

Showing 22 changed files with 169 additions and 0 deletions.
diff --git a/Real or Fake Job Posting Prediction/Dataset/README.md b/Real or Fake Job Posting Prediction/Dataset/README.md
@@ -0,0 +1,17 @@
+# Real or Fake Job Posting Prediction
+
+## Dataset
+The dataset used in this project is sourced from Kaggle.
+
+### About the Dataset
+The dataset, named "Real or Fake Job Posting Prediction," contains information about job postings labeled as either real or fake.
+
+### Data Source
+Link to the dataset: [Real or Fake Job Posting Prediction
+](https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction)
+
+### Description
+The dataset includes various features such as employment type, required experience, education level, industry, and function, along with a label indicating whether the job posting is real or fake.
+
+## Project Aim
+The aim of the project is to detect fraudulent job postings. This classification task involves building and training deep learning models to accurately differentiate between real job postings and those that are fake.
diff --git a/Real or Fake Job Posting Prediction/Images/ANN.png b/Real or Fake Job Posting Prediction/Images/ANN.png
diff --git a/Real or Fake Job Posting Prediction/Images/BiLSTM.png b/Real or Fake Job Posting Prediction/Images/BiLSTM.png
diff --git a/Real or Fake Job Posting Prediction/Images/BiLSTM_cm.png b/Real or Fake Job Posting Prediction/Images/BiLSTM_cm.png
diff --git a/Real or Fake Job Posting Prediction/Images/CNN.png b/Real or Fake Job Posting Prediction/Images/CNN.png
diff --git a/Real or Fake Job Posting Prediction/Images/CNN_cm.png b/Real or Fake Job Posting Prediction/Images/CNN_cm.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_employment_type.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_employment_type.png
diff --git a/...r Fake Job Posting Prediction/Images/Distribution_of_fraudulent_job_posting.png b/...r Fake Job Posting Prediction/Images/Distribution_of_fraudulent_job_posting.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_function.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_function.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_has_company_logo.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_has_company_logo.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_has_questions.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_has_questions.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_required_education.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_required_education.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_required_experience.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_required_experience.png
diff --git a/Real or Fake Job Posting Prediction/Images/Distribution_of_telecommunicating.png b/Real or Fake Job Posting Prediction/Images/Distribution_of_telecommunicating.png
diff --git a/Real or Fake Job Posting Prediction/Images/LSTM.png b/Real or Fake Job Posting Prediction/Images/LSTM.png
diff --git a/Real or Fake Job Posting Prediction/Images/ann_cm.png b/Real or Fake Job Posting Prediction/Images/ann_cm.png
diff --git a/Real or Fake Job Posting Prediction/Images/lstm_cm.png b/Real or Fake Job Posting Prediction/Images/lstm_cm.png
diff --git a/Real or Fake Job Posting Prediction/Images/word_cloud_fake.png b/Real or Fake Job Posting Prediction/Images/word_cloud_fake.png
diff --git a/Real or Fake Job Posting Prediction/Images/word_cloud_real.png b/Real or Fake Job Posting Prediction/Images/word_cloud_real.png
diff --git a/Real or Fake Job Posting Prediction/Model/README.md b/Real or Fake Job Posting Prediction/Model/README.md
@@ -0,0 +1,141 @@
+# Real or Fake Job Posting Prediction
+
+**GOAL**
+
+The aim of this project is to detect fraudulent job postings using machine learning models.
+
+**DATASET**
+
+[Real or Fake Job Posting Prediction
+](https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction)
+
+**DESCRIPTION**
+
+The dataset consists of job postings labeled as either real or fake. It includes various features such as job title, location, company profile, description, requirements, benefits, and more.
+
+**WHAT I DID**
+
+I conducted extensive data preprocessing tasks, including text cleaning, tokenization, and feature extraction using Word2Vec. The dataset was split into training and testing sets, and  ANN , LSTM , CNN and BiLSTM models were constructed for classification.
+
+**MODELS USED**
+
+The models developed are:
+
+1. Artificial Neural Network (ANN)
+2. Long Short-Term Memory (LSTM)
+3. Convolutional Neural Network (CNN)
+4. Bidirectional Long Short-Term Memory (BiLSTM)
+
+
+**LIBRARIES NEEDED**
+
+- pandas
+- numpy
+- matplotlib
+- seaborn
+- re
+- nltk
+- gensim
+- keras
+- imblearn
+- wordcloud
+
+**Exploratory Data Analysis (EDA)**
+
+### Distribution of Employment Type
+![Distribution of Employment Type](../Images/Distribution_of_employment_type.png)
+
+### Distribution of Required Education
+![Distribution of Required Education](../Images/Distribution_of_required_education.png)
+
+### Distribution of Job Function
+![Distribution of Job Function](../Images/Distribution_of_function.png)
+
+### Distribution of Required Experience
+![Distribution of Required Experience](../Images/Distribution_of_required_experience.png)
+
+### Distribution of Telecommuting
+![Distribution of Telecommuting](../Images/Distribution_of_telecommunicating.png)
+
+### Distribution of Has Company Logo
+![Distribution of Has Company Logo](../Images/Distribution_of_has_company_logo.png)
+
+### Distribution of Has Questions
+![Distribution of Has Questions](../Images/Distribution_of_has_questions.png)
+
+### Distribution of Fraudulent Job Postings
+![Distribution of Fraudulent Job Postings](../Images/Distribution_of_fraudulent_job_posting.png)
+
+### Word Cloud for Real Job Postings
+![Word Cloud for Real Job Postings](../Images/word_cloud_real.png)
+
+### Word Cloud for Fake Job Postings
+![Word Cloud for Fake Job Postings](../Images/word_cloud_fake.png)
+
+### Conclusions from EDA
+
+#### Imbalanced Dataset
+- The dataset has a significant imbalance between real and fake job postings, which could impact model performance and requires handling (e.g., through resampling techniques like SMOTE).
+
+#### Distinct Patterns in Fake Job Postings
+Fake job postings tend to have:
+- Less specific employment types and required experiences.
+- Lower educational requirements.
+- Higher likelihood of lacking a company logo.
+- Different word usage patterns in job descriptions, often focusing on more generic terms.
+
+#### Feature Importance for Classification
+- The presence of a company logo and specific job functions/experience/education levels are important indicators of job posting authenticity.
+- Text analysis of job descriptions reveals distinct patterns in fake job postings that can be leveraged for classification.
+
+**Model Training**
+
+### ANN Model Training Performance
+![ANN Model Training Performance](../Images/ANN.png)
+
+### LSTM Model Training Performance
+![LSTM Model Training Performance](../Images/LSTM.png)
+
+### CNN Model Training Performance
+![CNN Model Training Performance](../Images/CNN.png)
+
+### BiLSTM Model Training Performance
+![BiLSTM Model Training Performance](../Images/BiLSTM.png)
+
+**Confusion Matrix**
+
+### ANN Model Confusion Matrix
+![ANN Model Confusion Matrix](../Images/ann_cm.png)
+
+### LSTM Model Confusion Matrix
+![LSTM Model Confusion Matrix](../Images/lstm_cm.png)
+
+### CNN Model Confusion Matrix
+![CNN Model Confusion Matrix](../Images/CNN_cm.png)
+
+### BiLSTM Model Confusion Matrix
+![BiLSTM Model Confusion Matrix](../Images/BiLSTM_cm.png)
+
+
+
+
+
+
+### Overall Findings
+- **Imbalanced Dataset:** The dataset had a significant imbalance between real and fake job postings, which was addressed using SMOTE.
+- **Feature Importance:** Features such as the presence of a company logo, specific job functions, experience, and education levels were important indicators of job posting authenticity.
+- **Text Analysis:** Text analysis revealed distinct patterns in fake job postings, which were leveraged for classification.
+- **Model Comparison:** All models performed exceptionally well, with the BiLSTM model slightly outperforming the others. This can be attributed to its ability to capture bidirectional sequential dependencies in the data.
+- **Accuracy:** The BiLSTM model achieved the highest accuracy at 98.82%, followed by the LSTM model at 98.77%, CNN at 98.03%, and ANN at 97.96%.
+- **Loss:** The BiLSTM model had the lowest loss at 0.0487, indicating better model convergence.
+- **Classification Report:** All models showed high precision, recall, and F1-scores, indicating effective classification of both classes.
+
+**CONCLUSION**
+
+All models provide a robust foundation for detecting fraudulent job postings, and the insights gained from EDA can be used to further refine and improve model performance. While all models demonstrated strong performance in detecting patterns within the data, the BiLSTM model's superior ability to capture and leverage bidirectional dependencies resulted in the highest accuracy and best overall performance. This suggests that for this particular dataset and task, models that can capture sequential patterns in both forward and backward directions offer a significant advantage.
+
+
+**Connect with Me**
+
+- [LinkedIn](https://www.linkedin.com/in/barrenkala-veera-venkata-karthik-b58b9a285/)
+- [GitHub](https://github.com/Karthik110505)
diff --git a/Real or Fake Job Posting Prediction/Model/real-fake-job-posting-prediction.ipynb b/Real or Fake Job Posting Prediction/Model/real-fake-job-posting-prediction.ipynb
diff --git a/Real or Fake Job Posting Prediction/requirements.txt b/Real or Fake Job Posting Prediction/requirements.txt
@@ -0,0 +1,10 @@
+pandas
+numpy
+matplotlib
+seaborn
+re
+nltk
+gensim
+keras
+imblearn
+wordcloud