Taxonomy-Prediction_For-TCS-HumAIn_2019

In this repository, I submit my ideas, learnings and code during my attempt at solving Taxonomy Prediction Problem for TCS HumAIn 2019.

Problem Statement

Taxonomy Creation - For the given content, come up with a solution to build the taxonomy.

For a given Question, we have to predict the tags based on the text in Question Title and Body.

Dataset Description

Id - Unique identifier for each question.
Title - The question's title.
Body - The body of the question.
Tags - The tags associated with the question.

Title and Body Columns contain text of the question which is the input to the Taxonomy Prediction model.
Tags Column contain all the tags associated with the Title and Body of the Question.

Automatically predicting tags can be very useful for websites and their users, as Tagging of posts and question is time consuming and also very boring.

So, this is a multi-class multi-label problem. Training dataset has roughly 60M examples.

But, we use only 500,000 examples which are carefully sampled such that they contain only most-frequent 500 tags.

Requirements

Instead of installing these requirements, it's recommended to use Google Colab to run the notebook.
To use python scripts, the requirements are as following:

Python 3.7
Numpy
Pandas
Matplotlib
Scikit-Learn
Scipy
NLTK
Pickle

Steps to install requirements

Download and Install Python 3.7 from this link
Open Terminal
Run: pip -r install requirements.txt on Windows.
Use sudo pip -r install requirements.txt on Linux

Dataset Link

Dataset is taken from Facebook Recruiting III - Keyword Extraction competition on Kaggle.

Steps to download data

Download data from: link (2.2 GB)
Move the Downloaded file to data folder in the cloned Repository.
Extract Train.zip to obtain Train.csv
Repeat steps 2 and 3 for Test data downloaded from this link (725 MB)

How to run the Code

The code is available as both Notebook and Python Scripts. Anyone of these can be used.

To use the Notebook

Clone the Repository to your local machine or on Google Colab.
Download data using steps specified above and make sure that Train.csv is in data/ directory
In case you use local machine, create a new Virtual Environment and install all the requirements specified in requirements.txt using steps specified above.
Open Notebook in JupyterLab or in Google Colab.
Read through it directly or Execute all cells.

To use Scripts

Clone the repository to your machine.
Download data using steps specified above and make sure that Train.csv is in data/ directory
Create a new Virtual Environment and install all the requirements specified in requirements.txt using steps specified above.
Run each script in following order:
1. 1_data_sampling.py
2. 2_data_cleaning.py
3. 3_bag_of_words.py
4. 4_data_vectorize.py
5. 5_train_model.py
6. 6_test_model.py

How to Test

To test the model, we can use Test_new.npz that we created during the data sampling and cleaning steps.

First we have to train the model (either by running the notebook or by running scripts 1 to 5.)
Once the model is trained, we can test it using last cell of the notebook or using script 6_test_model.py (recommended).

NOTE: Make sure that Test_new.npz is in data/* directory.*

Algorithm {Pseudo Code}

I have created this project for Idea Submission round at TCS HumAIn 2019. The dataset and Problem description is provided by TCS team. So, a big Thanks to everyone at TCS.

This Project includes basically 6 big steps:

Data Exploration
Data Cleaning and Data Engineering Part-1
Data Engineering Part-2
Tokenize + Remove Stop words + Stemming + Vectorizing
Training
Testing

Data Exploration

Check for all NaN values in data.
Plot tags frequency.
Check how tags are distributed among question.
Obtain Conclusion and Strategize next steps.

Data Cleaning and Data Engineering Part-1

Remove unnecessary features from data [Id].
Drop all rows where Tags column is NaN.
Select Most Frequent 500 tags.
- Create a list of top 500 tags.
- Find indices of examples containing all tags as a subset of top 500 tags.
Sample 500,000 indices from list of indices obtained from previous step.
Sample the training set using those indices and save.

Data Engineering Part-2

Using Regular Expression, clean all the Titles in the Title column.
Separate Code part from the Body and put into Code column.
Clean Body column using Regular Expression.
Similarly, clean Code Column.
Create new Data Frame by adding Title, Body and Code columns, separated by space into a single column.

Tokenize + Remove Stop words + Stemming + Vectorizing

Loop through all the 500,000 examples:
- Tokenize the text
- Remove Stop words from it
- Stem the remaining words
- Join the words again to form a string
Save the new modified dataset.
Apply binary Count Vectorizer on Tags
Apply Count Vectorizer on Text

Training

Train Stochastic Gradient Descent model.
Train Support Vector Classifier.
Train Logistic Regression Classifier.

Testing

Test different models on data.
Select the best performance model

Thanks for reading

A big thanks to everyone who took their time to view my project.

A special thanks to TCS HumAIn team for giving me this opprtunity to learn and explore new skills required to tackle real world projects.

Challanges are healty for our brains and specially for data scientist. There is so much to learn, and every step and every opportunity makes us better than yesterday.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
models		models
scripts		scripts
vectorizers		vectorizers
.gitignore		.gitignore
README.md		README.md
TCS_HumAIn_Tanmay_Vijay_CT20192830706_Texonomy_Predictions.pdf		TCS_HumAIn_Tanmay_Vijay_CT20192830706_Texonomy_Predictions.pdf
Taxanomy-Prediction-Notebook.ipynb		Taxanomy-Prediction-Notebook.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taxonomy-Prediction_For-TCS-HumAIn_2019

Problem Statement

Dataset Description

So, this is a multi-class multi-label problem. Training dataset has roughly 60M examples.

But, we use only 500,000 examples which are carefully sampled such that they contain only most-frequent 500 tags.

Requirements

Instead of installing these requirements, it's recommended to use Google Colab to run the notebook.
To use python scripts, the requirements are as following:

Steps to install requirements

Dataset Link

Steps to download data

How to run the Code

To use the Notebook

To use Scripts

How to Test

NOTE: Make sure that Test_new.npz is in data/* directory.*

Algorithm {Pseudo Code}

I have created this project for Idea Submission round at TCS HumAIn 2019. The dataset and Problem description is provided by TCS team. So, a big Thanks to everyone at TCS.

This Project includes basically 6 big steps:

Data Exploration

Data Cleaning and Data Engineering Part-1

Data Engineering Part-2

Tokenize + Remove Stop words + Stemming + Vectorizing

Training

Testing

Thanks for reading

A big thanks to everyone who took their time to view my project.

A special thanks to TCS HumAIn team for giving me this opprtunity to learn and explore new skills required to tackle real world projects.

Challanges are healty for our brains and specially for data scientist. There is so much to learn, and every step and every opportunity makes us better than yesterday.

About

Releases

Packages

Languages

tanmayvijay/Taxonomy-Prediction_For-TCS-HumAIn_2019

Folders and files

Latest commit

History

Repository files navigation

Taxonomy-Prediction_For-TCS-HumAIn_2019

Problem Statement

Dataset Description

So, this is a multi-class multi-label problem. Training dataset has roughly 60M examples.

But, we use only 500,000 examples which are carefully sampled such that they contain only most-frequent 500 tags.

Requirements

Instead of installing these requirements, it's recommended to use Google Colab to run the notebook. To use python scripts, the requirements are as following:

Steps to install requirements

Dataset Link

Steps to download data

How to run the Code

To use the Notebook

To use Scripts

How to Test

NOTE: Make sure that Test_new.npz is in data/ directory.

Algorithm {Pseudo Code}

I have created this project for Idea Submission round at TCS HumAIn 2019. The dataset and Problem description is provided by TCS team. So, a big Thanks to everyone at TCS.

This Project includes basically 6 big steps:

Data Exploration

Data Cleaning and Data Engineering Part-1

Data Engineering Part-2

Tokenize + Remove Stop words + Stemming + Vectorizing

Training

Testing

Thanks for reading

A big thanks to everyone who took their time to view my project.

A special thanks to TCS HumAIn team for giving me this opprtunity to learn and explore new skills required to tackle real world projects.

Challanges are healty for our brains and specially for data scientist. There is so much to learn, and every step and every opportunity makes us better than yesterday.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Instead of installing these requirements, it's recommended to use Google Colab to run the notebook.
To use python scripts, the requirements are as following:

NOTE: Make sure that Test_new.npz is in data/* directory.*

Packages