In this repository, I submit my ideas, learnings and code during my attempt at solving Taxonomy Prediction Problem for TCS HumAIn 2019.
Taxonomy Creation - For the given content, come up with a solution to build the taxonomy.
- For a given Question, we have to predict the tags based on the text in Question Title and Body.
Id - Unique identifier for each question.
Title - The question's title.
Body - The body of the question.
Tags - The tags associated with the question.
Title and Body Columns contain text of the question which is the input to the Taxonomy Prediction model.
Tags Column contain all the tags associated with the Title and Body of the Question.
Automatically predicting tags can be very useful for websites and their users, as Tagging of posts and question is time consuming and also very boring.
But, we use only 500,000 examples which are carefully sampled such that they contain only most-frequent 500 tags.
Instead of installing these requirements, it's recommended to use Google Colab to run the notebook.
To use python scripts, the requirements are as following:
- Python 3.7
- Numpy
- Pandas
- Matplotlib
- Scikit-Learn
- Scipy
- NLTK
- Pickle
- Download and Install Python 3.7 from this link
- Open Terminal
- Run: pip -r install requirements.txt on Windows.
Use sudo pip -r install requirements.txt on Linux
Dataset is taken from Facebook Recruiting III - Keyword Extraction competition on Kaggle.
- Download data from: link (2.2 GB)
- Move the Downloaded file to data folder in the cloned Repository.
- Extract Train.zip to obtain Train.csv
- Repeat steps 2 and 3 for Test data downloaded from this link (725 MB)
The code is available as both Notebook and Python Scripts. Anyone of these can be used.
- Clone the Repository to your local machine or on Google Colab.
- Download data using steps specified above and make sure that Train.csv is in data/ directory
- In case you use local machine, create a new Virtual Environment and install all the requirements specified in requirements.txt using steps specified above.
- Open Notebook in JupyterLab or in Google Colab.
- Read through it directly or Execute all cells.
- Clone the repository to your machine.
- Download data using steps specified above and make sure that Train.csv is in data/ directory
- Create a new Virtual Environment and install all the requirements specified in requirements.txt using steps specified above.
- Run each script in following order:
- 1_data_sampling.py
- 2_data_cleaning.py
- 3_bag_of_words.py
- 4_data_vectorize.py
- 5_train_model.py
- 6_test_model.py
To test the model, we can use Test_new.npz that we created during the data sampling and cleaning steps.
- First we have to train the model (either by running the notebook or by running scripts 1 to 5.)
- Once the model is trained, we can test it using last cell of the notebook or using script 6_test_model.py (recommended).
I have created this project for Idea Submission round at TCS HumAIn 2019. The dataset and Problem description is provided by TCS team. So, a big Thanks to everyone at TCS.
- Data Exploration
- Data Cleaning and Data Engineering Part-1
- Data Engineering Part-2
- Tokenize + Remove Stop words + Stemming + Vectorizing
- Training
- Testing
- Check for all NaN values in data.
- Plot tags frequency.
- Check how tags are distributed among question.
- Obtain Conclusion and Strategize next steps.
- Remove unnecessary features from data [Id].
- Drop all rows where Tags column is NaN.
- Select Most Frequent 500 tags.
- Create a list of top 500 tags.
- Find indices of examples containing all tags as a subset of top 500 tags.
- Sample 500,000 indices from list of indices obtained from previous step.
- Sample the training set using those indices and save.
- Using Regular Expression, clean all the Titles in the Title column.
- Separate Code part from the Body and put into Code column.
- Clean Body column using Regular Expression.
- Similarly, clean Code Column.
- Create new Data Frame by adding Title, Body and Code columns, separated by space into a single column.
-
Loop through all the 500,000 examples:
- Tokenize the text
- Remove Stop words from it
- Stem the remaining words
- Join the words again to form a string
-
Save the new modified dataset.
-
Apply binary Count Vectorizer on Tags
-
Apply Count Vectorizer on Text
- Train Stochastic Gradient Descent model.
- Train Support Vector Classifier.
- Train Logistic Regression Classifier.
- Test different models on data.
- Select the best performance model