This repository is for our ACL2023 findings paper:
OpenPI-C: A Better Benchmark and Stronger Baseline for Open-Vocabulary State Tracking
The code is based on the original OpenPI dataset: https://allenai.org/data/openpi
OpenPI Dataset files are available in JSON format under data
. There are four files: {train,dev,test}.jsonl
and test.jsonl.clustered
.
The three files {train,dev,test}.jsonl
are of the same format representing the training, development and test sets respectively. Each line is a json representing a data point, i.e., one step in a process. An example is as follows:
{
"id": [
"www.wikihow.com/Stop-a-Mosquito-Bite-from-Itching-Using-a-Spoon",
1
],
"query": "It\u2019s always a good idea to disinfect an area that has been bitten or stung by an insect.",
"answers": [
[
"skin",
"cleanness",
"clean",
"covered in disinfectant"
],
[
"disinfectant",
"location",
"in bottle",
"on bite"
]
]
}
The json contains three fields:
id
: a two-tuple. The first item of the tuple is the ID for the process, and the second item represents which step it is in the process.query
: the description of the current step.answers
: the 4-tuples state tracking results. This field is a list where each item is a 4-tuple representing (entity, attribute, pre-state, post-state).
To facilitate cluster-based F1 evaluation, we pre-cluster test.jsonl
into test.jsonl.clustered
. It's basically the same as test.jsonl
except an additional field named answer_clusters
.
The field is a list of integers with the same length as answers
. Each item in answer_clusters
represents which cluster the corresponding item in answers
belongs to.
To set up the environment, it's recommended to create a new conda environment as follows:
conda create -y -n openpi-c python=3.8
conda activate openpi-c
Then install pytorch==1.7.0
as in https://pytorch.org/get-started/previous-versions/. For example, with CUDA 10.1:
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch
Finally, install the requirements via pip:
pip install -r requirements.txt
Unfortunately, due to version conflict, this environment doesn't support installing sentence-transformers
and thus doesn't support cluster-based F1 evaluation. We need to create a separate environment for cluster-based evaluation. The requirements for this environment are in requirements.cluster-f1.txt
. I'll recommend creating the environment as follows:
conda create -y -n cluster-f1 python=3.8
conda activate cluster-f1
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch # depending on your cuda version
pip install -r requirements.cluster-f1.txt
Before running cluster-based F1 scripts (as in Cluster-based F1), you should first activate this cluster-f1
environment.
For quickly reproducing our experiments, you can run the following scripts:
bash scripts/train_baseline.sh # for training BART baseline
bash scripts/train_concat-states.sh # for training BART+concat states model
bash scripts/train_econd.sh # for training ECond model
bash scripts/train_emem.sh # for training EMem model
Similarly, for evaluating BART baseline, BART+concat states and EMem model, simply run:
bash scripts/infer_baseline.sh # for evaluating BART baseline
bash scripts/infer_concat-states.sh # for evaluating BART+concat states model
bash scripts/infer_emem.sh # for evaluating EMem model
However, for evaluating BART+ECond, you first need to run evaluation for the BART baseline.
Then, the generation outputs would be at exps/baseline_facebook-bart-large/gen-out.formatted.jsonl
.
Then run evaluation for BART+ECond as follows:
bash scripts/infer_econd.sh exps/baseline_facebook-bart-large/gen-out.formatted.jsonl
Similarly, for evaluating BART+EMem+ECond, first run evaluation for BART+EMem and get the generation outputs at exps/emem_facebook-bart-large/gen-out.formatted.jsonl
.
Then, run evaluation for BART+EMem+ECond as follows:
bash scripts/infer_econd.sh exps/emem_facebook-bart-large/gen-out.formatted.jsonl
Unfortunately, due to the version conflict mentioned in the Environment Setup section, previous scripts are unable to produce cluster-based F1 numbers. To calculate cluster-based F1 results, first activate the corresponding environment such as:
conda activate cluster-f1
Then run scripts/cluster-based-f1.sh
based on the generation outputs. For example, for the BART baseline:
bash scripts/cluster-based-f1.sh exps/baseline_facebook-bart-large/gen-out.formatted.jsonl