wget http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip
python3 infrastructure/convert.py
prepares the train and test datasets merging positive and negative tweets with the appropriate labels (1: positive tweets, 0: negative tweets).
python3 preprocessing/preprocess_punctuation.py 'path-to-csv-data' 0/1
(0: training data preprocessing, 1: test data preprocessing
The baseline models can be found in the implementations/baselines
folder.
To download the essential data for this project:
mkdir data
cd data
wget http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip
(to download dataset)wget
Twitter pretrained embeddings from https://nlp.stanford.edu/projects/glove/mkdir interm_data
mkdir final_data
To build a co-occurence matrix, run the following commands:
Note that the cooc.py script takes a few minutes to run, and displays the number of tweets processed.
./infrastructure/build_vocab.sh
./infrastructure/cut_vocab.sh
python3 ./infrastructure/pickle_vocab.py
python3 ./infrastructure/cooc.py
(Not mandatory) For preprocessing:
python3 ./preprocessing/preprocess_baselines.py 'path-to-txt-data' 0/1 (for each data file seperately, 0: training data preprocessing, 1: test data preprocessing)
(Not mandatory) For removing duplicates:
python3 ./infrastructure/deduplication.py
For manually computing GloVe embeddings: (change number of dimensions)
python3 ./infrastructure/glove_compute.py
(Not mandatory) For pretrained GloVe embeddings: (needs previous manual computation of embeddings)
python3 ./infrastructure/glove_pretrained.py
For computing tweet embeddings:
python3 ./infrastructure/infrastructure.py
(for manual embeddings)python3 ./infrastructure/infrastructure_pretrained.py
(for pretrained embeddings)
For classification task:
python3 ./implementations/baselines/svm.py
python3 ./implementations/baselines/xgboost_impl.py
python3 ./implementations/baselines/logistic.py
Our LSTM-based model can be found in the implementations/lstm
folder.
For generating word indexes and word embeddings:
python3 implementations/lstm/process_data.py
For training/fine-tuning/testing the model:
python3 implementations/lstm/sentiment.py
The implementations regarding to BERT are contained in the implementations/bert
folder. There can be found the code scripts for:
- fine-tuning pretrained BERT models in
bert.py
. - fine-tuning pretrained BERT models using huggingface library in
bert_huggging_face.py
. - further pretraining the BERT model in
retrain/retrain.sh
and fine-tuning the generated model withretrain/fine_tune.sh
.
The implementations regarding to RoBERTa are contained in the implementations/robert
folder. There can be found code scripts for:
- fine-tuning RoBERTa base in
roberta_base.py
- fine-tuning RoBERTa large in
roberta_large.py
- fine-tuning RoBERTa large with an additional BiLSTM layer in
roberta_large_lstm.py
We used two ensembling methods in order to optimize our final model:
- Linear regression
- XGBoost
The respective implementations can be found in implementations/ensembling
folder.