SARN: Inference

Quantifiers and monotonicity (and opposite adjectives) in reasoning tasks

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

source .venv/bin/activate
# Finetuning
python -m sarn.train --output-dir models/bart-mq --log-dir logs/bart-mq facebook/bart-large-mnli data/training.csv
# Inference of two sequences (forwards)
python -m sarn.classify --model models/bart-mq "All dogs jumped over the fence." "Some small dogs jumped over the fence."
# ROC curve (SVG and PDF diagram)
python -m sarn.roc microsoft/deberta-large-mnli data/evaluation.csv
# Model accuracy on dataset
python -m sarn.accuracy models/deberta-mq data/evaluation.csv
# Dataset statistics
python -m sarn.stats data/training.csv
# Language Interpretability Tool
python -m sarn.lit \
  --models "facebook/bart-large-mnli" \
           "microsoft/deberta-large-mnli" \
           "./models/bart-mq" \
           "./models/deberta-mq" \
           "./models/bart-adj" \
           "./models/deberta-adj" \
  --datasets "./data/evaluation.csv" "./data/evaluation-adj.csv" \
  --cache_dir=cache_dir

As model, any valid Huggingface model (local or remote) can be specified that has been finetuned for sequence classification, e.g., facebook/bart-large-mnli, microsoft/deberta-large-mnli or a local path like models/bart-mq.

Tips

Download Models from CoLi Servers

# export COLI_USER=<your name>
scp -r ${COLI_USER:?}@last.cl.uni-heidelberg.de:/mnt/semproj/sem_proj20/proj1/models .

Check if labels do not have typos in datasets

for i in ./data/*.csv; do
  python3 -m sarn.validate_datasets "$i"
done

Datasets

See data/README.md for more information.

data/training.csv: Training dataset for quantifiers and monotonicity in reasoning tasks
data/evaluation.csv: Evaluation dataset for quantifiers and monotonicity in reasoning tasks
data/training.csv: Training dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives
data/evaluation.csv: Evaluation dataset for quantifiers and monotonicity in reasoning tasks with opposite adjectives

Creating the datasets

source .venv/bin/activate

# data/training.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med
wget https://github.com/verypluming/HELP/raw/master/output_en/pmb_train_v1.0.tsv
python -m sarn.convert.help
cat data/med.csv data/help.csv > data/training.csv

# data/evaluation.csv
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas
wget -O diagnostic-full.tsv https://www.dropbox.com/s/ju7d95ifb072q9f/diagnostic-full.tsv?dl=1
python -m sarn.convert.superglue
cat data/fracas.csv data/superglue.csv > data/evaluation.csv

# data/training-adj.csv
wget https://github.com/verypluming/MED/raw/master/MED.tsv
python -m sarn.convert.med_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/med_adjectives_1.csv by hand (third column)
# - label data/med_adjectives_2.csv by hand (third column)
# - remove fourth column in both files
wget https://nlp.stanford.edu/~wcmac/downloads/fracas.xml
python -m sarn.convert.fracas_adjectives
cat data/med_adjectives_1.csv data/med_adjectives_2.csv data/fracas_adjectives.csv > data/training-adj.csv
python -m sarn.validate_datasets data/training-adj.csv

# data/evaluation-adj.csv
python -m sarn.convert.evaluation_adjectives
# manual step here (you may modify sentences where it makes sense):
# - label data/evaluation-adj.csv by hand (third column)
# - remove fourth column
python -m sarn.validate_datasets data/evaluation-adj.csv

Dataset statistics

Character length

Dataset	avg	median	min	max
`data/training.csv`
Premises	48.26	41	5	478
Hypotheses	48.93	42	5	478
`data/evaluation.csv`
Premises	79.84	58	26	206
Hypotheses	61.57	50	26	186
`data/training-adj.csv`
Premises	48.93	44	14	212
Hypotheses	50.78	46	18	210
`data/evaluation-adj.csv`
Premises	100.19	83	25	189
Hypotheses	86.62	69	35	189

Word length

Dataset	avg	median	min	max
`data/training.csv`
Premises	9.98	9	2	83
Hypotheses	10.10	9	2	83
`data/evaluation.csv`
Premises	13.03	10	5	34
Hypotheses	10.14	9	5	30
`data/training-adj.csv`
Premises	8.89	8	3	29
Hypotheses	9.18	9	3	29
`data/evaluation-adj.csv`
Premises	15.44	12	5	31
Hypotheses	13.47	11	5	31

Labels

Dataset	total	contradiction	neutral	entailment
`data/training.csv`	41'273	0 (0.00%)	20'699 (50.15%)	20'574 (49.85%)
`data/evaluation.csv`	118	15 (12.71%)	52 (44.07%)	51 (43.22%)
`data/training-adj.csv`	1'206	420 (34.83%)	749 (62.11%)	37 (3.07%)
`data/evaluation-adj.csv`	144	47 (32.64%)	84 (58.33%)	13 (9.03%)

Models

facebook/bart-large-mnli: Pretrained model of BART finetuned on MultiNLI
microsoft/deberta-large-mnli: Pretrained model of DeBERTa finetuned on MultiNLI
models/bart-mq: finetuned version of facebook/bart-large-mnli on data/training.csv
models/deberta-mq: finetuned version of microsoft/deberta-large-mnli on data/training.csv
models/bart-adj: finetuned version of models/bart-mq on data/training-adj.csv
models/deberta-adj: finetuned version of models/deberta-mq on data/training-adj.csv

Creating the models

source .venv/bin/activate

# facebook/bart-large-mnli and microsoft/deberta-large-mnli will automatically
# be downloaded from huggingface.co when used

# models/bart-mq
python -m sarn.train --output-dir "models/bart-mq" --log-dir "logs/bart-mq" "facebook/bart-large-mnli" "data/training.csv"

# models/deberta-mq
python -m sarn.train --output-dir "models/deberta-mq" --log-dir "logs/deberta-mq" "microsoft/deberta-large-mnli" "data/training.csv"

# models/bart-adj
python -m sarn.train --output-dir "models/bart-adj" --log-dir "logs/bart-adj" "facebook/bart-large-mnli" "data/training-adj.csv"

# models/deberta-adj
python -m sarn.train --output-dir "models/deberta-adj" --log-dir "logs/deberta-adj" "microsoft/deberta-large-mnli" "data/training-adj.csv"

Model statistics

Accuracy

Model	`data/evaluation.csv`	`data/evaluation-adj.csv`
`facebook/bart-large-mnli`	65.25%	40.97%
`microsoft/deberta-large-mnli`	71.19%	47.22%
`models/bart-mq`	57.63%	34.72%
`models/deberta-mq`	61.86%	34.72%
`models/bart-adj`	45.76%	58.33%
`models/deberta-adj`	42.37%	57.64%

ROC curves

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. [...] AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). [...] AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.

— Google Machine Learning Crash Course

BART	DeBERTa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SARN: Inference

Setup

Usage

Tips

Download Models from CoLi Servers

Check if labels do not have typos in datasets

Datasets

Creating the datasets

Dataset statistics

Character length

Word length

Labels

Models

Creating the models

Model statistics

Accuracy

ROC curves

Files

README.md

Latest commit

History

README.md

File metadata and controls

SARN: Inference

Setup

Usage

Tips

Download Models from CoLi Servers

Check if labels do not have typos in datasets

Datasets

Creating the datasets

Dataset statistics

Character length

Word length

Labels

Models

Creating the models

Model statistics

Accuracy

ROC curves