-
-
Notifications
You must be signed in to change notification settings - Fork 352
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #677 from Karthik110505/main
Real or Fake Job Posting Prediction
- Loading branch information
Showing
22 changed files
with
169 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Real or Fake Job Posting Prediction | ||
|
||
## Dataset | ||
The dataset used in this project is sourced from Kaggle. | ||
|
||
### About the Dataset | ||
The dataset, named "Real or Fake Job Posting Prediction," contains information about job postings labeled as either real or fake. | ||
|
||
### Data Source | ||
Link to the dataset: [Real or Fake Job Posting Prediction | ||
](https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction) | ||
|
||
### Description | ||
The dataset includes various features such as employment type, required experience, education level, industry, and function, along with a label indicating whether the job posting is real or fake. | ||
|
||
## Project Aim | ||
The aim of the project is to detect fraudulent job postings. This classification task involves building and training deep learning models to accurately differentiate between real job postings and those that are fake. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+32.9 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_employment_type.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+21.9 KB
...r Fake Job Posting Prediction/Images/Distribution_of_fraudulent_job_posting.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+112 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+25.2 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_has_company_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+21.3 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_has_questions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+69.8 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_required_education.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+41.9 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_required_experience.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+24.7 KB
Real or Fake Job Posting Prediction/Images/Distribution_of_telecommunicating.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
# Real or Fake Job Posting Prediction | ||
|
||
**GOAL** | ||
|
||
The aim of this project is to detect fraudulent job postings using machine learning models. | ||
|
||
**DATASET** | ||
|
||
[Real or Fake Job Posting Prediction | ||
](https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction) | ||
|
||
**DESCRIPTION** | ||
|
||
The dataset consists of job postings labeled as either real or fake. It includes various features such as job title, location, company profile, description, requirements, benefits, and more. | ||
|
||
**WHAT I DID** | ||
|
||
I conducted extensive data preprocessing tasks, including text cleaning, tokenization, and feature extraction using Word2Vec. The dataset was split into training and testing sets, and ANN , LSTM , CNN and BiLSTM models were constructed for classification. | ||
|
||
**MODELS USED** | ||
|
||
The models developed are: | ||
|
||
1. Artificial Neural Network (ANN) | ||
2. Long Short-Term Memory (LSTM) | ||
3. Convolutional Neural Network (CNN) | ||
4. Bidirectional Long Short-Term Memory (BiLSTM) | ||
|
||
|
||
**LIBRARIES NEEDED** | ||
|
||
- pandas | ||
- numpy | ||
- matplotlib | ||
- seaborn | ||
- re | ||
- nltk | ||
- gensim | ||
- keras | ||
- imblearn | ||
- wordcloud | ||
|
||
**Exploratory Data Analysis (EDA)** | ||
|
||
### Distribution of Employment Type | ||
![Distribution of Employment Type](../Images/Distribution_of_employment_type.png) | ||
|
||
### Distribution of Required Education | ||
![Distribution of Required Education](../Images/Distribution_of_required_education.png) | ||
|
||
### Distribution of Job Function | ||
![Distribution of Job Function](../Images/Distribution_of_function.png) | ||
|
||
### Distribution of Required Experience | ||
![Distribution of Required Experience](../Images/Distribution_of_required_experience.png) | ||
|
||
### Distribution of Telecommuting | ||
![Distribution of Telecommuting](../Images/Distribution_of_telecommunicating.png) | ||
|
||
### Distribution of Has Company Logo | ||
![Distribution of Has Company Logo](../Images/Distribution_of_has_company_logo.png) | ||
|
||
### Distribution of Has Questions | ||
![Distribution of Has Questions](../Images/Distribution_of_has_questions.png) | ||
|
||
### Distribution of Fraudulent Job Postings | ||
![Distribution of Fraudulent Job Postings](../Images/Distribution_of_fraudulent_job_posting.png) | ||
|
||
### Word Cloud for Real Job Postings | ||
![Word Cloud for Real Job Postings](../Images/word_cloud_real.png) | ||
|
||
### Word Cloud for Fake Job Postings | ||
![Word Cloud for Fake Job Postings](../Images/word_cloud_fake.png) | ||
|
||
### Conclusions from EDA | ||
|
||
#### Imbalanced Dataset | ||
- The dataset has a significant imbalance between real and fake job postings, which could impact model performance and requires handling (e.g., through resampling techniques like SMOTE). | ||
|
||
#### Distinct Patterns in Fake Job Postings | ||
Fake job postings tend to have: | ||
- Less specific employment types and required experiences. | ||
- Lower educational requirements. | ||
- Higher likelihood of lacking a company logo. | ||
- Different word usage patterns in job descriptions, often focusing on more generic terms. | ||
|
||
#### Feature Importance for Classification | ||
- The presence of a company logo and specific job functions/experience/education levels are important indicators of job posting authenticity. | ||
- Text analysis of job descriptions reveals distinct patterns in fake job postings that can be leveraged for classification. | ||
|
||
**Model Training** | ||
|
||
### ANN Model Training Performance | ||
![ANN Model Training Performance](../Images/ANN.png) | ||
|
||
### LSTM Model Training Performance | ||
![LSTM Model Training Performance](../Images/LSTM.png) | ||
|
||
### CNN Model Training Performance | ||
![CNN Model Training Performance](../Images/CNN.png) | ||
|
||
### BiLSTM Model Training Performance | ||
![BiLSTM Model Training Performance](../Images/BiLSTM.png) | ||
|
||
**Confusion Matrix** | ||
|
||
### ANN Model Confusion Matrix | ||
![ANN Model Confusion Matrix](../Images/ann_cm.png) | ||
|
||
### LSTM Model Confusion Matrix | ||
![LSTM Model Confusion Matrix](../Images/lstm_cm.png) | ||
|
||
### CNN Model Confusion Matrix | ||
![CNN Model Confusion Matrix](../Images/CNN_cm.png) | ||
|
||
### BiLSTM Model Confusion Matrix | ||
![BiLSTM Model Confusion Matrix](../Images/BiLSTM_cm.png) | ||
|
||
|
||
|
||
|
||
|
||
|
||
### Overall Findings | ||
- **Imbalanced Dataset:** The dataset had a significant imbalance between real and fake job postings, which was addressed using SMOTE. | ||
- **Feature Importance:** Features such as the presence of a company logo, specific job functions, experience, and education levels were important indicators of job posting authenticity. | ||
- **Text Analysis:** Text analysis revealed distinct patterns in fake job postings, which were leveraged for classification. | ||
- **Model Comparison:** All models performed exceptionally well, with the BiLSTM model slightly outperforming the others. This can be attributed to its ability to capture bidirectional sequential dependencies in the data. | ||
- **Accuracy:** The BiLSTM model achieved the highest accuracy at 98.82%, followed by the LSTM model at 98.77%, CNN at 98.03%, and ANN at 97.96%. | ||
- **Loss:** The BiLSTM model had the lowest loss at 0.0487, indicating better model convergence. | ||
- **Classification Report:** All models showed high precision, recall, and F1-scores, indicating effective classification of both classes. | ||
|
||
**CONCLUSION** | ||
|
||
All models provide a robust foundation for detecting fraudulent job postings, and the insights gained from EDA can be used to further refine and improve model performance. While all models demonstrated strong performance in detecting patterns within the data, the BiLSTM model's superior ability to capture and leverage bidirectional dependencies resulted in the highest accuracy and best overall performance. This suggests that for this particular dataset and task, models that can capture sequential patterns in both forward and backward directions offer a significant advantage. | ||
|
||
|
||
**Connect with Me** | ||
|
||
- [LinkedIn](https://www.linkedin.com/in/barrenkala-veera-venkata-karthik-b58b9a285/) | ||
- [GitHub](https://github.com/Karthik110505) |
1 change: 1 addition & 0 deletions
1
Real or Fake Job Posting Prediction/Model/real-fake-job-posting-prediction.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
pandas | ||
numpy | ||
matplotlib | ||
seaborn | ||
re | ||
nltk | ||
gensim | ||
keras | ||
imblearn | ||
wordcloud |