Total number of utterances: 6070
Total number of paraphrases: 39666
Total number of admissible advisor queries: 2920
A json file containing:
'Sen_Legend' : a dictionary that maps sentence ids to sentence strings
'Ner_Legend' : a dictionary that maps sentence ids to ner output
'Set_Examples' : a list of examples (see below)
To reduce file size, sentence strings are encoded by their ids, and can be
retrieved by indexing into Sen_Legend
.
Each example in Set_Examples
is structured as follows.
Set_Examples =
[
{
'x_prior_utterances': [{}, ...],
'y_correct_paras': [{}, ...],
'y_paraphrase_bag': [{}, ...],
}, ...
]
x_prior_utterances
is a list of prior utterances in temporal / spoken order
y_correct_paras
is a list of paraphrases (the correct subset of y_paraphrase_bag)
y_paraphrase_bag
is a list of paraphrases in shuffled order
No paraphrases appear in preceding dialog
Preceding utterances are all original (not replaced with their paraphrases)
Paraphrases appear in preceding dialog
Preceding utterances are replaced with their paraphrases at random
--trainsize=INT (desired size of the generated train set)
default value: 1000
--testsize=INT (desired size of the generated test set)
default value: 100
--bagsize=INT (desired size of the paraphrase bag)
default value: 100
--withpara=BOOL (generate examples by replacing with paraphrases)
default value: 0
--setname=STRING (name for set being generated)
default value: Datetime
--numhold=INT (number of dialogs to hold for generating the test set)
default value: 50
--fillwboth=BOOL (fill test set paraphrase bags with test+train utterances)
default value: 0
--withpara=0 generates Dataset Type 0
--withpara=1 generates Dataset Type 1
--fillwboth=0 test para bag filled w/ only test utterances
--fillwboth=1 test para bag filled w/ train+test utterances
We omit examples where there are no prior utterances (advisor speaks first)
We omit examples where there are 0 correct paraphrases in the bag
To generate a dataset with the following sizes:
train set size == 1234
test set size == 567
para bag size == 89
with paraphrase == 0
Run the following command from the /scripts directory
python3 generate.py --trainsize=1234 --testsize=567 --bagsize=89 --withpara=1