This is the 3rd place solution to ACM International Conference on Web Search and Data Mining(WSDM) Cup 2019, a challenge to fake news detection and sentence pairs modeling.
-
Clone this project.
-
Download the dataset from the corresponding competition on Kaggle and extract it under the directory
zake7749/data/dataset
|-- dataset
|-- sample_submission.csv
|-- test.csv
`-- train.csv
- Prepare the embedding models
We use 2 open-source pretrained word embeddings in this competiton:
- Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
- Chinese-Word-Vectors
- We select the SGNS version on word and n-gram level, trained with the mixed-large corpus
And put these two embeddings under the folder zake7749/data/wordvec/
|-- wordvec
|-- Tencent_AILab_ChineseEmbedding.txt
`-- sgns.merge.bigram
The notebooks are under the folder zake7749/code
- Execute
Stage 1.1. Preprocessing-on-word-level.ipynb
- Execute
Stage 1.2. Preprocessing-on-char-level.ipynb
These notebooks would generate 8 cleaned datasets under zake7749/data/processed_dataset
.
.
|-- engineered_chars_test.csv
|-- engineered_chars_train.csv
|-- engineered_words_test.csv
|-- engineered_words_train.csv
|-- processed_chars_test.csv
|-- processed_chars_train.csv
|-- processed_words_test.csv
`-- processed_words_train.csv
Execute Stage 1.3. Train-char-embeddings
, which would output 3 char embeddings under zake7749/data/wordvec/
|-- wordvec
|-- Tencent_AILab_ChineseEmbedding.txt
|-- fasttext-50-win3.vec
|-- sgns.merge.bigram
|-- zh-wordvec-50-cbow-windowsize50.vec
`-- zh-wordvec-50-skipgram-windowsize7.vec
- Execute
Stage 2. First-Level-with-char-level.ipynb
- Execute
Stage 2. First-Level-with-word-level.ipynb
- Execute
Stage 3.1. First-level-ensemble-ridge-regression
- Execute
Stage 3.2. First-level-ensemble-with-LGBM-each-side
- Execute
Stage 3.3. First-level-ensemble-with-LGBM
- Execute
Stage 3.4. First-level-ensemble-with-NN
- Execute
Stage 3.5. Second-level-ensemble
- Run script
hanshan/bert/train_wsdm.sh
- To get predictions file to submit at this stage run
zake7749/bert/data/probs_to_preds.py
- Execute
Stage 3.6. Bagging-with-BERT
** Note: Please change the path of sec_stacking_df to the corresponding file **
- Execute
Stage 4.1. Fine-tune-word-level-models.ipynb
- Execute
Stage 4.2. Fine-tune-char-level-models.ipynb
- Run
hanshan/prep_pseudo_labels.py
- Run script
hanshan/bert/train_wsdm_pl.sh
- Execute
Stage 5.1. First-level-fine-tuned-ensemble-ridge-regression.ipynb
- Execute
Stage 5.2. First-level-fine-tuned-ensemble-withNN.ipynb
- Execute
Stage 5.3. First-level-fine-tuned-ensemble-with-LGBM.ipynb
- Execute
Stage 5.4. Second-level-fine-tuned-ensemble.ipynb
- Execute
Stage 9. High-Ground.ipynb
- Execute
Stage 42. Final Answer.ipynb
The final prediction final_answer.csv
would be generated under the folder zake7749/data/high_ground/