Notebooks, results and dataset on android and superuser dataset All the results unless explicity mentioned are based on android dataset
Contains all the baseline notebook with the saved output trained on android dataset and their traces (Model loss, accuracy on train) (related+linked)
Note : Evaluator Section of the code in the notebooks contains some bug and are fork of sentence-transformer evaluators
- BERT_Only.ipynb(unimodal): BERT Bi-encoder, passing title+tag+body as an input
- BERT_Only+Tags(unimodal): (BERT + BiLSTM) Bi-encoder,passing title+body to the bert and tag into BiLSTM
- BERT_BERT.ipynb(unimodal): (BERT+BERT) Bi-encoder, passing title+body to one BERT and tags to another BERT
- RESNET+BERT(multimodal): Resnet + BERT, passing title+tag+body as an input to BERT and images to RESNET and concat both
- Image Pairs Related.docx: Contains stats about number of images and non images pair in train,dev and test
- USE_Baseline: Contains the baseline using Universal sentence encoder
- BERT_from_Scratch: Contains the raw bert (not fine tuning) model trained on all the stackexchange dataset combined in large corpus of around 140mb. Results are not good because of very very less data
- hugging_face: official hugging face notebook on how to train bert. Found after writing BERT_from_scratch.ipynb
- Multimodal_final.ipynb : Final multimodal notebook consists of model implemented in Xinyu paper, with optimized dataset spilliting and exclusivity in dev,test and train
- Multimodal_1.ipynb : Multimodal notebook with simple dataset splitting
- android_matching_data.ipynb: Combining the title,tag and body into one string
- Linked_data.ipynb: Notebook created to retreive linked field using stackexchange api
- Final_Data_Refining.ipynb: Filling up the missing or empty data fields and refining the data
- Related.ipynb: Retreiving all the related field and the corresponding missing questions without images
- Text_Corpus.ipynb: Combined all the dataset to form a large corpus
- Text_Preprocessing.ipynb: Preproccessed title and body by removing stopwords, expanding words and removing html tags (standard preprocessing)
- Title: Fixed the missing title in the dataset
- Web_Crawler: Wrote a python script to crawl stackexchange sites. Useless, eventually blocked, only way is to use stackexchange api
Contains models and notebooks implemented using sbert library in a bi-encoder way. Losses and accuracy function are inspired from their sbert implementation on quora related pair dataset (Check the paths to the dataset before running)
- simple_sbert : Contains a simpler version of sbert model
- sbert_linked: sbert on linked dataset
- sbert_scenario2(incomplete).ipynb: Implemented the scenario 2 of augmented sbert where we score the given dataset using a cross-encoder, here cross encoder was trained on quora dataset provided by sbert. Scores are not good, that's why rejected
- sbert_scenario2_official.ipynb: official example of scenario 2 provided by sbert
- Test.ipynb: Notebook for detailed testing of model and how it sort of works
- Superuser: Contains the latest and last version of bert unimodal and multimodal with freezing different layers of resnet and bert. Also, contains the finalized preprocessing and pair formation script. All the results and data path are modified according to the server and are based on superuser. Warning, 'OCR' is not implemented completely. Server also contains the finalized. Note that it will take around 5-6 days for running 10 epochs on the server. Same folder is present in /home/ckm/visualqatickets/superuser ; with data. Some data is also present on Sir's PC in 'E:/visualqatickets'
- final_model.ipynb: Contains the sbert implementation, taking input of all the pairs of all the stackexchange dataset.
- all_data_softmax_3epoch_model:Contains saved sentence-transformer model (using sbert lib) trained on combining all the datasets available (related only)
- Linked_android_model(cross_encoder): Contains the result of training an cross encoder (sentence transformer) on linked dataset
- Linked: Contains all the baseline evaluated on linked dataset seperately
- MultiModal: Contains the multimodal notebooks, data splits for android and saved models and traces of different experiment
- quora_android_2: Contains the sbert model based on BERT and distill-bert with different margins, losses and accuracy.
- quora_android(bad_data_splitting): Contains a buggy implementation of data spliiting. May contains some useful functions
- quora_apple: Contains the notebook of sbert folder, implemented on apple dataset
- preproccesed_data : Contains all the pre-proccesed data; nosw: no stop words, l: linked
- SemEval 2017 task 3 dataset: Contains research on community-based QA forums
- Shah(2018): SOTA model (before bert) on android, superuser etc.
- USE_Baseline.ipynb: Baseline constructed using universal sentence encoder
- MultiModal Embeddings: Contains the corresponding research papers, raw data and the colab files used for preprocessing