[DASFAA2023] The source codes and datasets for paper: Adversarial Learning-based Stance Classifier for COVID-19-related Health Policies
.
For embedding-based methods, we apply GloVe to initialize the words embedding layer, while for BERT-based methods, we directly use the AutoTokenizer for words tokenization. The dataset preprocess file is in the:
- src/datasets_glove.py
- src/datasets.py
For GloVe download:
cd data
wget wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip
For BERT-based models:
- Install Huggingface Transformers:
conda install -c huggingface transformers
pip install emoji
pip install -r requirements.txt
Topic | #Unlabeled | #Labeled (Favor/Against/None) |
---|---|---|
Stay at Home (SH) | 778 | 420 (194/113/113) |
Wear Masks (WM) | 1030 | 756 (173/288/295) |
Vaccination (VA) | 1535 | 526 (106/194/226) |
To obtain the raw tweet data, please contact: xiefeng@nudt.edu.cn
For more detailed descriptions about the dataset, please clike here!
- cross-target setting:
python run.py --topic vaccination,face_masks --model mymodel --batch 16 --epoch 100 --hidden 128 --p_lambda 0.1 --alpha 0.01 --backbone bert_base
- zero-shot setting:
python run.py --topic zeroshot,face_masks --model mymodel --batch 16 --epoch 100 --hidden 256 --p_lambda 0.1 --alpha 0.01 --backbone bert_base --lr 0.00002
Parameter | Description | Default | Values. |
---|---|---|---|
--model | the running models | mymodel | bilstm, bicond, textcnn, crossnet, tan, bert_base, mymodel |
--topic | the running tasks | - | cross-target setting or zero-shot setting |
--batch | batch size | 16 | - |
--epoch | the number of epochs of traning process | 100 | - |
--patience | we conduct early stop with fixed patience | 5 | - |
--max_len | the maximum length of tokens | 100 | 100, 150 |
--hidden | the hidden dimension of model if it need it | 128 | 128, 256 |
--alpha | the trade-off parameter of objectives | 0.01 | - |
--p_lambda | negative constant in Gradient Reversal Layer | 0.1 | - |
All implemented methods are stored in the folder of src/baselines
.
- BiLSTM (1997): Long short-term memory.
- BiCond (2016): Stance Detection with Bidirectional Conditional Encoding.
- TAN (2017): Stance Classification with Target-Specific Neural Attention Networks.
- CrossNet (2018): Cross-Target Stance Classification with Self-Attention Networks.
- SiamNet (2019): Can siamese networks help in stance detection?
- Bert (2019): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Bertweet (2020): BERTweet: A pre-trained language model for English Tweets.
- Covid-Tweet-Bert (2020): COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter.
- WS-Bert (2022): Infusing Knowledge from Wikipedia to Enhance Stance Detection.
All programs are implemented using 3.6.13 and PyTorch 1.10.2 with 11.3 on a personal workstation with an NVIDIA GeForce RTX 3090 GPU. The reported results are the averaged score of 5 runs with different random initialization.
In cross-target setting, the models are trained and validated on one topic and evaluated on another. There can be categorized into six source->destination tasks for cross-target evaluation: SH->WM, SH->VA, WM->SH, WMVA, VA->SH, and VA->WM. In zero-shot setting, the models are trained and validated on multiple topics and tested on one unseen topic. We use the unseen topic's name as the task's name, thus, the zero-shot evaluation can be set into: SH, WM, and VA. For all tasks, the batch size is set to 16, the dropout rate is set to 0.1, and the input texts are truncated or padded to a maximum of 100 tokens. We train all models using AdamW optimizer with weight decay 5e-5 for a maximum of 100 epochs with patience of 10 epochs, and the learning rate is chosen in {1e-5, 2e-5}.
For BiLSTM, BiCond, TAN, CrossNet, TextCNN, the word embeddings are initialized with the pre-trained word vectors from GloVe, and the hidden dimension is optimized in {128, 256}. For BERT, we fine-tune the pre-trained language model from the Hugging Face Transformer Library to predict the stance by appending a linear classification layer to the hidden representation of the [CLS] token. In terms of WS-BERT-S and WS-BERT-D, considering the computational resource and fair comparison, the maximum length of Wikipedia summaries is set to 100 tokens and we use the pre-trained uncased BERT-base as encoder, in which each word is mapped to a 768-dimensional embedding. To speed up the training process, we only finetune the top layers of the Wikipedia encoder in WS-BERT-D, which is consistent with paper. In our model, we also adopt the pre-trained uncased BERT-base as encoder. The maximum length of policy description is fixed at 50, the layer number l of GCN is set to 2, the trade-off parameter alpha is set to 0.01, the GRL's parameter lambda is set to 0.1, and the hidden dimension of GeoEncoder is optimized in {128, 256}.
The tweet set for each policy contained a good mixture of pro, con, and neutral categories, as well as tweets with implicit and explicit opinions about the target. We removed the hashtags that appeared at the end of a tweet to exclude obvious cues, without making the tweet syntactically ambiguous. Each tweet was annotated by three annotators to avoid subjective errors of judgment. At present, we only collect social content on Twitter, without considering other social platforms, such as Weibo.
Our dataset does not provide any personally identifiable information as only the tweet IDs and human-annotated stance labels will be shared. Thus, the dataset complies with Twitter’s information privacy policy.
We refer to the codes of these repos: WS-BERT, GVB, DANN. Thanks for their great contributions!