Quora Question Pair Similarity

Description

Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Problem statement

“Given two question are they duplicates of each other”

Data set

source : https://www.kaggle.com/c/quora-question-pairs/overview/description

COLUMNS

- id : tuple unique id
- qid1/qid2 : question id of question1/question2 
- question1/question2 : actual question in string 
- is_duplicate : is question1 is duplicate of question2? (1:YES/0:NO)

Basic Feature Addition

freq_qid1 = Frequency of qid1's
freq_qid2 = Frequency of qid2's
q1len = Length of q1
q2len = Length of q2
q1_n_words = Number of words in Question 1
q2_n_words = Number of words in Question 2
word_Common = (Number of common unique words in Question 1 and Question 2)
word_Total =(Total num of words in Question 1 + Total num of words in Question 2)
word_share = (word_common)/(word_Total)
freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

Advanced Feature Addition

cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2
cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2
csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2
csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2
ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
last_word_eq : Check if First word of both questions is equal or not
first_word_eq : Check if First word of both questions is equal or not
abs_len_diff : Abs. length difference
mean_len : Average Token Length of both Questions
fuzz_ratio
fuzz_partial_ratio
token_sort_ratio
token_set_ratio :
longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2

Vector representation : idf-GLOVE

Merging

Basic feature extraction 
+
Advanced feature extraction 
+
Vector representation

Machine Learning

Data split : 70/30 (70pc training and 30pc testing)

Logistic Regression with hyperparameter tuning

For values of alpha =  1e-05 The log loss is: 0.592800211149
For values of alpha =  0.0001 The log loss is: 0.532351700629
For values of alpha =  0.001 The log loss is: 0.527562275995
For values of alpha =  0.01 The log loss is: 0.534535408885
For values of alpha =  0.1 The log loss is: 0.525117052926
For values of alpha =  1 The log loss is: 0.520035530431
For values of alpha =  10 The log loss is: 0.521097925307

Linear SVM with hyperparameter tuning

For values of alpha =  1e-05 The log loss is: 0.657611721261
For values of alpha =  0.0001 The log loss is: 0.489669093534
For values of alpha =  0.001 The log loss is: 0.521829068562
For values of alpha =  0.01 The log loss is: 0.566295616914
For values of alpha =  0.1 The log loss is: 0.599957866217
For values of alpha =  1 The log loss is: 0.635059427016
For values of alpha =  10 The log loss is: 0.654159467907

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
1.Reading _Data.ipynb		1.Reading _Data.ipynb
2_Basic_Feature_Extraction.ipynb		2_Basic_Feature_Extraction.ipynb
3_Advanced_feature_addition.ipynb		3_Advanced_feature_addition.ipynb
4_Glove_and_merge.ipynb		4_Glove_and_merge.ipynb
5_ML_models.ipynb		5_ML_models.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quora Question Pair Similarity

Description

Problem statement

Data set

Basic Feature Addition

Advanced Feature Addition

Vector representation : idf-GLOVE

Merging

Machine Learning

Logistic Regression with hyperparameter tuning

Linear SVM with hyperparameter tuning

This project was done with the help of

About

Releases

Packages

Languages

sonyD4d/Quora-Question-Pair-Similarity

Folders and files

Latest commit

History

Repository files navigation

Quora Question Pair Similarity

Description

Problem statement

Data set

Basic Feature Addition

Advanced Feature Addition

Vector representation : idf-GLOVE

Merging

Machine Learning

Logistic Regression with hyperparameter tuning

Linear SVM with hyperparameter tuning

This project was done with the help of

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages