Sentence embeddings for unsupervised event detection in the Twitter stream: study on English and French corpora
This repo aims at letting researchers reproduce our Twitter event detection results on 2 datasets: the Event2012 dataset (tweets in English) and the Event2018 dataset (tweets in French, a full description of this dataset is provided in our paper.)
Since some tweets have probably been erased since we collected the datasets, we cannot ensure 100% identical results, but we are confident that the comparative performance of the models will remain unchanged.
Results of our unsupervised event detection (clustering) algorithm for each embedding type, depending on the threshold (t) parameter.
Some of the embeddings presented on this graph are not available here:
- English corpus: w2v-twitter
- French corpus: w2v-news, w2v-twitter, elmo
All other embeddings are available and the corresponding modules are listed in the requirements.txt
file.
You will only need to install the bert-as-service
module (see the BERT section) and run the BERT server
to test the BERT embedding.
- Installation
- Download Event2012 dataset
- Download Event2018 dataset
- Use 'First Story Detection' on you own dataset
- Clustering
- Classification
- Available embeddings
- Cite
We recommand using Anaconda 3 to create a python 3.8 environment (install Anaconda here):
conda create -n "twembeddings" python=3.8.2
source activate twembeddings
Then clone the repo and install the module:
cd $HOME
git clone https://github.com/bmaz/twembeddings.git
cd twembeddings
pip install .
In compliance with Twitter terms of use, the authors of the dataset do not share the tweets content,
but only the tweets IDs. Accept the
dataset agreement
and download the dataset. Untar the folder, the labeled tweets are in the relevant_tweets.tsv
file.
We provide a script to download the tweets' content from the Twitter API. In order to run the script, you need to create a Twitter developper account and a Twitter App. Then get the app's access tokens. You should now have 4 tokens (the following strings are random examples):
- app_key: mIsU1P0NNjUTf9DjuN6pdqyOF
- app_secret: KAd5dpgRlu0X3yizTfXTD3lZOAkF7x0QAEhAMHpVCufGW4y0t0
- oauth_token: 4087833385208874171-k6UR7OGNFdfBcqPye8ps8uBSSqOYXm
- oauth_token_secret: Z9nZBVFHbIsU5WQCGT7ZdcRpovQm0QEkV4n4dDofpYAEK
Run the script:
python get_tweets_objects.py \
--path /yourpath/relevant_tweets.tsv \
--dataset event2012 \
--app_key mIsU1P0NNjUTf9DjuN6pdqyOF \
--app_secret KAd5dpgRlu0X3yizTfXTD3lZOAkF7x0QAEhAMHpVCufGW4y0t0 \
--oauth_token 4087833385208874171-k6UR7OGNFdfBcqPye8ps8uBSSqOYXm \
--oauth_token_secret Z9nZBVFHbIsU5WQCGT7ZdcRpovQm0QEkV4n4dDofpYAEK
The script may take some time to run entirely, since it respects the API's rate limit. Because of tweets beeing removed and Twitter accounts being closed, some tweets are no longer available. Our last download (November 2019) allowed us to retrieve 72484 tweets (72% of the original dataset).
In compliance with Twitter terms of use, we do not share the tweets content,
but only the tweets IDs. The corpus is available here.
Please fill-in the agreement form and indicate the name of the corpus (Event2018) in your application.
Untar the folder, the labeled tweets are in the relevant_tweets.tsv
file.
You can then download the full tweet content by creating your Twitter access tokens and running the script:
python get_tweets_objects.py \
--path /yourpath/relevant_tweets.tsv \
--dataset event2018 \
--app_key mIsU1P0NNjUTf9DjuN6pdqyOF \
--app_secret KAd5dpgRlu0X3yizTfXTD3lZOAkF7x0QAEhAMHpVCufGW4y0t0 \
--oauth_token 4087833385208874171-k6UR7OGNFdfBcqPye8ps8uBSSqOYXm \
--oauth_token_secret Z9nZBVFHbIsU5WQCGT7ZdcRpovQm0QEkV4n4dDofpYAEK
The script may take some time to run entirely, since it respects the API's rate limit. Because of tweets beeing removed and Twitter accounts being closed, some tweets are no longer available. Our last download (November 2019) allowed us to retrieve 77249 tweets (81% of the original dataset).
Save your Twitter data in the form of a tsv file (csv file with "\t" as separator)
in the data
folder with the following column names:
id | label | text | created_at |
---|
created_at
is the date of the tweet. The format can be either 2018-07-16 05:00:56
or Mon Jul 16 05:00:56 +0000 2018
(Twitter format).
label
is the ground truth that you may use to evaluate the algorithm.
You can leave the column empty if you have no ground truth.
You can then run something like this ( --annotation no
indicates that you have no annotated ground truth):
python clustering.py --dataset data/yourfile.tsv --lang fr --model tfidf_dataset --threshold 0.7 --annotation no
A new file with the predicted labels in column "pred" will be saved in data/yourfile_results.tsv
.
The evaluation of the chosen parameters (if you have ground truth labels to evaluate on) will be saved to
results_clustering.csv
.
Run clustering with one or several embedding names as model
parameter.
python clustering.py --dataset data/event2012.tsv --lang en --model tfidf_dataset w2v_gnews_en sbert_nli_sts
or
python clustering.py --dataset data/event2018.tsv --lang fr --model tfidf_dataset tfidf_all_tweets use
You can test several threshold parameters for the First Story Detection Algorithm by modifying the options.yaml file.
Model | t (en) | F1 (en) | t (fr) | F1 (fr) |
---|---|---|---|---|
bert | 0.04 | 39.22 | 0.04 | 44.79 |
bert-tweets | - | - | 0.02 | 50.02 |
elmo | 0.08 | 22.48 | 0.2 | 46.08 |
sbert-nli-sts | 0.39 | 58.24 | - | - |
tfidf-all-tweets | 0.75 | 70.1 | 0.7 | 78.05 |
tfidf-dataset | 0.65 | 68.07 | 0.7 | 74.39 |
use | 0.22 | 55.71 | 0.46 | 74.57 |
w2v-news | 0.3 | 53.99 | 0.25 | 66.34 |
w2v-news tfidf-weights | 0.31 | 61.81 | 0.3 | 75.55 |
w2v-twitter | 0.16 | 43.2 | 0.15 | 57.53 |
w2v-twitter tfidf-weights | 0.2 | 53.45 | 0.25 | 71.73 |
Run classification with one or several embedding names as model
parameter.
python classification.py --dataset data/event2012.tsv --lang en --model tfidf_dataset w2v_gnews_en sbert_nli_sts
or
python classification.py --dataset data/event2018.tsv --lang fr --model tfidf_dataset bert
Additionnal options for each model can be modified in options.yaml
Model | F1±std (en) | F1±std (fr) |
---|---|---|
bert | 74.49±0.41 | 78.46±0.68 |
bert-tweets | - | 81.77±0.7 |
elmo | 59.81±0.41 | 73.59±0.64 |
sbert-nli-sts | 80.55±0.33 | - |
tfidf-all-tweets | 83.5±0.78 | 87.79±0.58 |
tfidf-all-tweets svd | 62.4±0.72 | 75.32±0.88 |
tfidf-dataset | 83.46±0.72 | 87.66±0.69 |
tfidf-dataset svd | 58.24±0.52 | 75.92±0.56 |
use | 80.26±0.38 | 87.45±0.6 |
w2v-news | 81.35±0.53 | 86.59±0.8 |
w2v-news tfidf-weights | 82.39±0.64 | 87.51±0.71 |
w2v-twitter | 76.68±0.53 | 87.01±0.56 |
w2v-twitter tfidf-weights | 81.2±0.48 | 87.73±0.56 |
Since the same word is rarely used several times in the same tweet, we used the idf expression rather than the tfidf
With option tfidf_dataset
, only the annotated tweets are used to count words.
With option tfidf_all_tweets
, all tweets in the corpora (millions of tweets) are used to count words.
Google model pretrained on google news with mean-pooling of word representations as sentence embedding.
Pretrained model on TensorFlow Hub with mean-pooling of word representations as sentence embedding.
In case you want to use BERT embeddings, you need to install bert-as-service
:
pip install bert-serving-server
pip install bert-serving-client
Then follow the guidelines to download a BERT model (we used BERT-Large, Cased for English and BERT-Base, Multilingual Cased for French) and start the BERT service:
bert-serving-start -model_dir=/yourpath/cased_L-24_H-1024_A-16 -max_seq_len=500 -max_batch_size=64
or
bert-serving-start -model_dir=/yourpath/multi_cased_L-12_H-768_A-12 -max_seq_len=500 -max_batch_size=64
Our program will act as a client to this service.
We use the default parameters of bert-as-service
: the pooling layer is the second-to-last layer,
and mean-pooling is used for sentence embedding.
Pretrained model on TensorFlow Hub. The multilingual model was used for French.
Pretrained model from UKPLab. We use bert-large-nli-stsb-mean-tokens model.
Details of the implemented approaches can be found in our publication: Mazoyer, B., Hervé, N., Hudelot, C., & Cagé, J. (2020). “Représentations lexicales pour la détection non supervisée d’événements dans un flux de tweets : étude sur des corpus français et anglais”. In “Extraction et Gestion des Connaissances (EGC 2020)”.
If you don't speak French, but have an access to Springer publications, you can read : Mazoyer, B., Hervé, N., Hudelot, C., & Cagé, J. (2024).“Comparison of Short-Text Embeddings for Unsupervised Event Detection in a Stream of Tweets”. In: Jaziri, R., Martin, A., Cornuéjols, A., Cuvelier, E., Guillet, F. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 1110. Springer, Cham. https://doi.org/10.1007/978-3-031-40403-0_4
Last, the broad principles of the event detection method are detailed in: Mazoyer, B., Cagé, J., Hervé, N. & Hudelot, C. (2020). “A French Corpus for Event Detection on Twitter”. In “International Conference on Language Resources and Evaluation (LREC 2020)”, 6220–6227.