Fake News Detection

This is the 3rd place solution to ACM International Conference on Web Search and Data Mining(WSDM) Cup 2019, a challenge to fake news detection and sentence pairs modeling.

Documents

Slides
Paper

Reproduce our results

1. Setup

Clone this project.
Download the dataset from the corresponding competition on Kaggle and extract it under the directory zake7749/data/dataset

|-- dataset
    |-- sample_submission.csv
    |-- test.csv
    `-- train.csv

Prepare the embedding models

We use 2 open-source pretrained word embeddings in this competiton:

Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
Chinese-Word-Vectors
- We select the SGNS version on word and n-gram level, trained with the mixed-large corpus

And put these two embeddings under the folder zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    `-- sgns.merge.bigram

2. Instructions

The notebooks are under the folder zake7749/code

Pre-processing

Execute Stage 1.1. Preprocessing-on-word-level.ipynb
Execute Stage 1.2. Preprocessing-on-char-level.ipynb

These notebooks would generate 8 cleaned datasets under zake7749/data/processed_dataset.

.
|-- engineered_chars_test.csv
|-- engineered_chars_train.csv
|-- engineered_words_test.csv
|-- engineered_words_train.csv
|-- processed_chars_test.csv
|-- processed_chars_train.csv
|-- processed_words_test.csv
`-- processed_words_train.csv

Train the char-level embedding

Execute Stage 1.3. Train-char-embeddings, which would output 3 char embeddings under zake7749/data/wordvec/

|-- wordvec
    |-- Tencent_AILab_ChineseEmbedding.txt
    |-- fasttext-50-win3.vec
    |-- sgns.merge.bigram
    |-- zh-wordvec-50-cbow-windowsize50.vec
    `-- zh-wordvec-50-skipgram-windowsize7.vec

Train the base models (LB 0.84 ~ 0.86)

Execute Stage 2. First-Level-with-char-level.ipynb
Execute Stage 2. First-Level-with-word-level.ipynb

Ensemble the predictions of base models (LB 0.873)

Execute Stage 3.1. First-level-ensemble-ridge-regression
Execute Stage 3.2. First-level-ensemble-with-LGBM-each-side
Execute Stage 3.3. First-level-ensemble-with-LGBM
Execute Stage 3.4. First-level-ensemble-with-NN
Execute Stage 3.5. Second-level-ensemble

Fine-tune the cls vector of BERT (LB 0.867)

Run script hanshan/bert/train_wsdm.sh
To get predictions file to submit at this stage run zake7749/bert/data/probs_to_preds.py

Blend the predictions of ensemble NNs with BERT (LB 0.874)

Execute Stage 3.6. Bagging-with-BERT

** Note: Please change the path of sec_stacking_df to the corresponding file **

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

Execute Stage 4.1. Fine-tune-word-level-models.ipynb
Execute Stage 4.2. Fine-tune-char-level-models.ipynb

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

Run hanshan/prep_pseudo_labels.py
Run script hanshan/bert/train_wsdm_pl.sh

Ensemble the predictions of fine-tuned base models (LB 0.879)

Execute Stage 5.1. First-level-fine-tuned-ensemble-ridge-regression.ipynb
Execute Stage 5.2. First-level-fine-tuned-ensemble-withNN.ipynb
Execute Stage 5.3. First-level-fine-tuned-ensemble-with-LGBM.ipynb
Execute Stage 5.4. Second-level-fine-tuned-ensemble.ipynb

Final Blending with post-processing (LB 0.881)

Execute Stage 9. High-Ground.ipynb
Execute Stage 42. Final Answer.ipynb

The final prediction final_answer.csv would be generated under the folder zake7749/data/high_ground/

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
hanshan		hanshan
zake7749		zake7749
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake News Detection

Documents

Reproduce our results

1. Setup

2. Instructions

Pre-processing

Train the char-level embedding

Train the base models (LB 0.84 ~ 0.86)

Ensemble the predictions of base models (LB 0.873)

Fine-tune the cls vector of BERT (LB 0.867)

Blend the predictions of ensemble NNs with BERT (LB 0.874)

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

Ensemble the predictions of fine-tuned base models (LB 0.879)

Final Blending with post-processing (LB 0.881)

About

Releases

Packages

Contributors 2

Languages

License

zake7749/WSDM-Cup-2019

Folders and files

Latest commit

History

Repository files navigation

Fake News Detection

Documents

Reproduce our results

1. Setup

2. Instructions

Pre-processing

Train the char-level embedding

Train the base models (LB 0.84 ~ 0.86)

Ensemble the predictions of base models (LB 0.873)

Fine-tune the cls vector of BERT (LB 0.867)

Blend the predictions of ensemble NNs with BERT (LB 0.874)

Fine-tune the base models with noisy labels (LB 0.86 ~ 0.875)

Fine-tune the cls vector of BERT with noisy labels (LB 0.880)

Ensemble the predictions of fine-tuned base models (LB 0.879)

Final Blending with post-processing (LB 0.881)

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages