-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
feger
committed
May 12, 2023
0 parents
commit 83d8e56
Showing
9 changed files
with
615,391 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
|
||
# User-specific stuff | ||
.idea/ | ||
data/import | ||
data/backup_tweets.csv | ||
data/url_dict.json | ||
poetry.lock | ||
pyproject.toml | ||
runs/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# :taco: TACO -- Twitter Arguments from COnversations | ||
|
||
This repository contains the annotation framework, dataset and code used for the resource paper *"TACO -- Twitter Arguments from COnversations"*. | ||
|
||
| Notice: To execute this project, it can be run on [Google Colab](https://colab.research.google.com). | | ||
|------------------------------------------------------------------------------------------------------| | ||
|
||
**Table of Contents:** | ||
|
||
- [Repository Layout](#repository-layout) | ||
- [Findings](#findings) | ||
- [Publication](#publication) | ||
- [Licensing](#licensing) | ||
- [Contact](#contact) | ||
- [Acknowledgements](#acknowledgements) | ||
|
||
## Repository Layout | ||
|
||
1. [data](./data) | ||
1. [annotation_framework.pdf](./data/annotation_framework.pdf): The annotation framework for TACO. | ||
2. [import](./data/import): Conversations and expert decisions were brought in from external sources (*). | ||
3. [unify.py](./data/import/unify.py): Used to build the following ground truth data as specified in section 2.2 of the paper. | ||
1. [backup_tweets.csv](./data/backup_tweets.csv): Containing the clear text of all tweets (*). | ||
2. [url_dict.json](./data/url_dict.json): The resolved tiny URLs in order to trace the original URLs (*). | ||
3. [conversations.csv](./data/conversations.csv): Having stored the structure of conversations. | ||
4. [majority_votes.csv](./data/majority_votes.csv): All the majority votes, which serve as the labeled ground truth. | ||
5. [worker_decisions.csv](./data/worker_decisions.csv): All individual expert decisions. | ||
2. [notebooks](./notebooks) | ||
1. [dataset_statistics.ipynb](./notebooks/dataset_statistics.ipynb): For the dataset statistics as specified in the sections 2.2 - 2.4 | ||
of the paper. | ||
2. [classifier_cv.ipynb](./notebooks/classifier_cv.ipynb): For training and evaluating the baseline model as in the section 3 of the paper. | ||
1. [bertweet_cv_predictions.csv](./outputs/bertweet_cv_predictions.csv): The cross-validation result of the trained baseline model. | ||
|
||
| (*): The sensitive user data contained in these files should not be made public. Please [contact](#contact) for additional information. | | ||
|-----------------------------------------------------------------------------------------------------------------------------------------| | ||
|
||
## Findings | ||
|
||
### Sample Distribution | ||
|
||
count percent sample-time | ||
abortion 486 26.8% 2021/08/15-10/16 | ||
brexit 535 29.5% 2020/01/01-03/01 | ||
got 192 10.6% 2019/04/01-05/01 | ||
lotrrop 209 11.5% 2022/02/01-03/01 | ||
squidgame 226 12.5% 2021/09/10-10/10 | ||
twittertakeover 166 9.2% 2022/04/01-05/01 | ||
|
||
### Dataset Distribution | ||
|
||
Argument No-Argument | ||
865 (48.88%) 869 (50.12%) | ||
Reason Statement Notification None | ||
581 (33.5%) 284 (16.4%) 500 (28.8%) 369 (21.3%) | ||
|
||
### Conversational Reply Patterns | ||
|
||
Reason Statement Notification None | ||
Reason 0.51 0.12 0.31 0.06 | ||
Statement 0.38 0.21 0.33 0.08 | ||
Notification 0.26 0.08 0.57 0.09 | ||
None 0.26 0.08 0.44 0.22 | ||
|
||
### Performance Multi-Class Classification Task (BERTweet) | ||
|
||
precision recall f1-score support | ||
Reason 0.7369 0.7522 0.7445 581 | ||
Statement 0.5437 0.5915 0.5666 284 | ||
Notification 0.7902 0.7760 0.7830 500 | ||
None 0.8387 0.7751 0.8056 369 | ||
|
||
accuracy 0.7376 1734 | ||
macro avg 0.7274 0.7237 0.7249 1734 | ||
weighted avg 0.7423 0.7376 0.7395 1734 | ||
|
||
### Performance Binary Classification Task (BERTweet) | ||
|
||
precision recall f1-score support | ||
No-Argument 0.8666 0.8297 0.8477 869 | ||
Argument 0.8359 0.8717 0.8534 865 | ||
|
||
accuracy 0.8506 1734 | ||
macro avg 0.8513 0.8507 0.8506 1734 | ||
weighted avg 0.8513 0.8506 0.8506 1734 | ||
|
||
### Textual Features | ||
|
||
Reason Statement Notification None | ||
Average Length 213 122 156 63 | ||
URLs 34.6% 8.1% 71.6% 7.6% | ||
external URLs 41.8% 17.4% 49.7% 17.9% | ||
Emojis 11.9% 14.1% 16.0% 35.8% | ||
Hashtags 45.8% 38.7% 60.0% 12.2% | ||
Users 65.9% 68.0% 56.4% 91.3% | ||
Discourse Marker 32.9% 19.0% 11.4% 8.7% | ||
|
||
### Error Analysis | ||
|
||
Reason Statement Notification None | ||
Reason 437 76 66 2 | ||
Statement 73 168 13 30 | ||
Notification 63 26 388 23 | ||
None 20 39 24 286 | ||
|
||
## Publication | ||
|
||
## Licensing | ||
|
||
<p> | ||
<a property="dct:title" rel="cc:attributionURL" href="https://github.com/TomatenMarc/TACO">TACO -- Twitter Arguments from Conversations</a> by | ||
<a rel="cc:attributionURL dct:creator" property="cc:attributionName" href="http://marc-feger.de">Marc Feger</a> is licensed under | ||
<a href="http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">CC BY-NC-SA 4.0</a> | ||
<div style="display:block;"> | ||
<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"> | ||
<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"> | ||
<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1"> | ||
<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1"> | ||
</div> | ||
</p> | ||
|
||
## Contact | ||
|
||
Please contact [marc.feger@uni-duesseldorf.de](marc.feger@uni-duesseldorf.de) or [stefan.dietze@gesis.org](stefan.dietze@gesis.org). | ||
|
||
## Acknowledgements | ||
|
||
We thank Aylin Martin, Tillmann Junk, Andreas Burbach, Talha Caliskan, and Aaron Schneider for their contributions to the | ||
annotation process in this paper. |
Binary file not shown.
Oops, something went wrong.