Portuguese translation of the GLUE benchmark, SNLI, and Scitail
using OPUS-MT model and Google Cloud Translation.
Datasets | Translation Tool |
---|---|
CoLA, MRPC, RTE, SST-2, STS-B, and WNLI | Google Cloud Translation |
SNLI, MNLI, QNLI, QQP, and SciTail | OPUS-MT |
from datasets import load_dataset
data = load_dataset("dlb/plue", "cola")
# ['cola', 'sst2', 'mrpc', 'qqp_v2', 'stsb', 'snli', 'mnli', 'mnli_mismatched', 'mnli_matched', 'qnli', 'qnli_v2', 'rte', 'wnli', 'scitail']
Larger files are not hosted on github repository.
-
DVC integration
$ pip install dvc $ dvc pull datasets/SNLI/train_raw.tsv $ dvc pull datasets/SNLI/train.tsv $ dvc pull datasets/MNLI/train.tsv $ dvc pull pairs/QQP.json
-
ZIP links
├── code ____________ # translation code and dependency parsing
├── datasets
│ ├── CoLA
│ ├── MNLI
│ ├── MRPC
│ ├── QNLI
│ ├── QNLI_v2
│ ├── QQP_v2
│ ├── RTE
│ ├── SciTail
│ │ └── tsv_format
│ ├── SNLI
│ ├── SST-2
│ ├── STS-B
│ └── WNLI
└── pairs ____________ # translation pairs as JSON dictionary
- GLUE provides two versions: first and second. We noticed the versions only differs in QNLI and QQP datasets, where we made QNLI available in both versions and QQP in the newest version.
- LX parser, Binarizer code and NLTK word tokenizer were used to create dependency parsings for SNLI and MNLI datasets.
- SNLI train split is a ragged matrix, so we made available two version of the data: train_raw.tsv contains irregular lines and train.tsv excludes those lines.
- Manual translation were made on 12 sentences due to translation errors.
- Our translation code is outdated. We recommend using from others.
@misc{Gomes2020,
author = {GOMES, J. R. S.},
title = {PLUE: Portuguese Language Understanding Evaluation},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ju-resplande/PLUE}},
commit = {e7d01cb17173fe54deddd421dd735920964eb26f}
}
- Deep Learning Brasil/CEIA
- Cyberlabs