Datasets and source code for the paper ID10M: Idiom Identification in 10 Languages.
Please consider citing our work if you use data and/or code from this repository.
@inproceedings{tedeschi-etal-2022-id10m,
title = "{ID}10{M}: Idiom Identification in 10 Languages",
author = "Tedeschi, Simone and
Martelli, Federico and
Navigli, Roberto",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-naacl.208",
doi = "10.18653/v1/2022.findings-naacl.208",
pages = "2715--2726",
abstract = "Idioms are phrases which present a figurative meaning that cannot be (completely) derived by looking at the meaning of their individual components.Identifying and understanding idioms in context is a crucial goal and a key challenge in a wide range of Natural Language Understanding tasks. Although efforts have been undertaken in this direction, the automatic identification and understanding of idioms is still a largely under-investigated area, especially when operating in a multilingual scenario. In this paper, we address such limitations and put forward several new contributions: we propose a novel multilingual Transformer-based system for the identification of idioms; we produce a high-quality automatically-created training dataset in 10 languages, along with a novel manually-curated evaluation benchmark; finally, we carry out a thorough performance analysis and release our evaluation suite at https://github.com/Babelscape/ID10M.",
}
In a nutshell, ID10M is a novel framework consisting of systems, training and validation data, and benchmarks for the identification of idioms in 10 languages.
Here you can find the automatically-created data that we used to train and evaluate our systems:
Language | Train | Dev | Sentences | Tokens | Idioms | B | I | O | Literal |
---|---|---|---|---|---|---|---|---|---|
Chinese | train_chinese.tsv | dev_chinese.tsv | 9543 | 244422 | 1301 | 5272 | 3823 | 235327 | 3918 |
Dutch | train_dutch.tsv | dev_dutch.tsv | 20935 | 548872 | 189 | 4530 | 10543 | 533799 | 16366 |
English | train_english.tsv | dev_english.tsv | 37919 | 1199492 | 4568 | 10102 | 19884 | 1169506 | 27408 |
French | train_french.tsv | dev_french.tsv | 35588 | 939161 | 188 | 12112 | 25248 | 901801 | 23238 |
German | train_german.tsv | dev_german.tsv | 26963 | 722109 | 819 | 8311 | 11500 | 702298 | 18488 |
Italian | train_italian.tsv | dev_italian.tsv | 29523 | 813445 | 452 | 8768 | 12353 | 792324 | 20506 |
Japanese | train_japanese.tsv | dev_japanese.tsv | 6388 | 211437 | 165 | 2534 | 1662 | 207241 | 3852 |
Polish | train_polish.tsv | dev_polish.tsv | 36333 | 862265 | 648 | 12971 | 14364 | 834930 | 22467 |
Portuguese | train_portuguese.tsv | dev_portuguese.tsv | 30942 | 764017 | 559 | 5824 | 8871 | 749322 | 24816 |
Spanish | train_spanish.tsv | dev_spanish.tsv | 28647 | 648776 | 1229 | 9994 | 13927 | 624855 | 17851 |
We underline that the just reported training data are automatically produced, hence they may contain errors. For further details about the produced silver data, please refer to the Section 3.1 of the paper.
Here you can find the test sets used to evaluate our systems:
Language | Test | Sentences | Tokens | Idioms | B | I | O | Seen | Unseen | Literal |
---|---|---|---|---|---|---|---|---|---|---|
English | test_english.tsv | 200 | 3287 | 142 | 159 | 373 | 2755 | 62 | 80 | 41 |
German | test_german.tsv | 200 | 4529 | 111 | 181 | 377 | 3971 | 71 | 40 | 19 |
Italian | test_italian.tsv | 200 | 5043 | 139 | 155 | 271 | 4617 | 87 | 52 | 48 |
Spanish | test_spanish.tsv | 200 | 2240 | 78 | 133 | 348 | 1759 | 19 | 59 | 66 |
For further details about the produced test data refer to the Section 3.2 of the paper.
The pretrained models are available here:
For further details about the neural architecture refer to the Section 3.3 of the paper.
To run the code, you just need to perform the following steps:
-
Install the requirements:
pip install -r requirements.txt
The code requires python >= 3.8, hence we suggest you to create a conda environment with python 3.8.
-
To train or test the system, you just need to run the main.py file
python src/main.py
Once the program is started it asks you to specify if you want to train or test the system, the desired language, etc.
If you train the system, model checkpoints will be saved in the src/checkpoints folder. Otherwise, if you evaluate your system, the script will load the model checkpoints stored in the src/checkpoints folder.
ID10M is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found here.
We underline that the source from which the raw sentences have been extracted is Wiktionary (wiktionary.org) and the BIO annotations identifying idiomatic expressions have been produced by Babelscape.
We gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme (http://mousse-project.org/) and the support of the ELEXIS project No. 731015 under the European Union’s Horizon 2020 ([http://mousse-project.org/](http://mousse-project.org/)).