This repository contains the code for the paper from From Sentence to Action: Splitting AMR Graphs for Recipe Instructions, DMR 2023.
It contains the scripts to separate sentence-level AMR graphs into AMR graphs for individual action events and to generate action-event level recipe instructions for the obtained AMRs.
The code for training the generation model by fine-tuning a T5-based AMR-to-text model for this task can be found in the recipe-generation-model repository.
The Wiki contains more details about the algorithms implemented and AMR graphs structures and file format.
The implemented steps of the overall pipeline are
- Parsing each recipe sentence by sentence into AMR graphs with recipe-level node-to-token alignments, see AMR Parsing
- Separating the AMR graphs into sub-graphs in order to get one AMR per action-event in the corresponding action graph for the recipe, see AMR Splitting
- Extracting approximated gold instructions for the split action-level amr graphs as well as extracting action-level instructions based on dependency information only, see Extraction
- Generating a recipe text based on an action graph, the amr graphs corresponding to each action node and a graph traversal, see Generating Recipe Texts
Tested with Python 3.6 and newer versions.
Run pip install -e .
in the main repository directory. This will enable successful import of all modules and functions within the repository. Additionally, this will already install most of the dependencies with the following two exceptions:
The pytorch library (e.g. 1.10.1)
Transformers from Huggingface (version 3 will probably not work): (e.g. 4.11.3). Depending on your OS and environment set up you can use one of these two commands for the installation.
conda install -c huggingface transformers
pip install transformers
Note: Running the AMR parser requires further dependencies than the ones listed in the current section (see amr_parsing Readme).
See the Readme in the amr_parsing folder for more details on creating the AMR representations of a recipe corpus and the requirements. If a dataset of recipe AMRs with node-to-token alignments is already available, the amr_parsing subfolder can be excluded to avoid the need to install the dependencies for the parser.
For the details on how the AMR splitting algorithm works see the Wiki.
Create a folder data
in the main project folder. Add the folder with the ARA 1.1 corpus to the data
folder and call it ara1.1
.
Create a folder data_ara2
in the main project folder. Add the folder with the ARA2 corpus to the data_ara2
folder and call it ara2.0
Additionally, add the folder with the parsed sentence-level AMRs (including node-token alignments matching the token IDs of the ARA corpus) and call it recipe_amrs_sentences
.
Instead of naming the folders as explained above, you can simply adapt the ARA_DIR
and SENT_AMR_DIR
variables in utils/paths.py
to match your folder structure.
Folder structures should be
---data
|---ara1.1
|---dish1
|---recipes
|---dish1_0.conllu
|---dish1_1.conllu
...
|---alignments.tsv
|---dish2
...
|---recipe_amrs_sentences
|---dish1
|---dish1_0_sentences_amr.txt
|---dish1_1_sentences_amr.txt
...
|---dish2
...
Then run the amr_splitting.py
script. It will run the AMR splitting algorithm on all AMRs in the recipe_amrs_sentences
folder. The separated version of the corpus will be stored in the (automatically created) folder data/recipe_amrs_actions
with one subfolder per dish, directly containing the .txt files for each recipe.
Additionally, two logging files will be created in the (automatically created) logs
folder.
- non_separable_amrs.txt: lists the names of all AMRs that were not separable as well as those that were separable using the fallback cases
- splitting_log.txt: additional information about the dataset, e.g. number of AMRs before splitting, number of AMRs after splitting, ...
Each log file gets the date and time, at which it was created, as a unique prefix to avoid overwriting.
The separated AMRs that the splitting algorithm produces still include the original sentence corresponding to the original AMR as their '::snt'' meta data. In order to extract instructions for the separated action-level AMRs navigate to training/prepare_data_sets and run the following:
python generate_gold_action_instructions.py --sep_dir [sep_dir] --orig_dir [orig_dir] --ara_dir [ara_dir] --out_dir [out_dir] --text
sep_dir
: optional; path to the parent directory with the separated action-level amrs, defaults to ACTION_AMR_DIR defined in utils/pathsorig_dir
: optional; path to the parent directory with the original sentence-level amrs, defaults to SENT_AMR_DIR defined in utils/pathsara_dir
: optional; path to the parent directory of the ara corpus, defaults to ARA_DIR defined in utils/pathsoutput dir
: path to the directory where the generated instructions should be saved, gets created if it doesn't exist yet; will have the same folder structure as the A-AMR dircoref
: optional; whether to use coreference information in order to remove redundant actions; Requires prepared coreference files in the JOINED_COREF_DIR directory; defaults to Falsetext
: optional; whether only the generated instructions should be save in output files or the A-AMRs together with the extracted gold instructions as part of the metadata; defaults to False
For more details about the extraction itself see the wiki page.
In order to generate one action-event level recipe based on a (specific) action graph, run
python generate_recipe.py --file [action_graph_file] --cont [context_len] --order [ordering_version] --config [configuration_file] --out [output_file]
action_graph_file
: path to .conllu file with the action graph of the recipecontext_len
: number of previously generated sentences to include as input to the generation model (should not be larger than the context_len the model was trained with)ordering_version
: optional; version of the traversing function to use, default is "pf-lf-id"configuration_file
: path to .json file with the configurations for the generationoutput_file
: optional; the path to the file where the generated texts will be saved, each tab separated column will contain the recipe from one traversal, one sentence per line, first line is a header; if not provided then the generated texts will only be printed to the command line
In order to generate all action-event level recipes of a dataset split run
python generate_data_set_split.py --split [split_file] --type [split_type] --cont [context_length] --order [ordering_version] --config [configuration_file] --out [output_directory]
split_file
: file with the assignments of the recipes to the different dataset splitssplit_type
: the name of the split to generatecontext_len
: number of previously generated sentences to include as input to the generation model (should not be larger than the context_len the model was trained with)ordering_version
: optional; version of the traversing function to use, default is "pf-lf-id"configuration_file
: path to .json file with the configurations for the generationoutput_dir
: optional; the path to the directory where the generated texts will be saved, if not provided it will default to a folder "output" in the main project directory; for each recipe specified insplit_file
one output file will be created (named [recipe_name]_generated.txt), one instruction per line; ifordering_version
is "all" then the output file will contain one one column per traversal (tab separated) with a header line at the top
ordering_version
Can be "top", "ids", "pf", "pf-lf" or "pf-lf-id", see wiki page for details of the different traversals, or can be set to "all" to generate one recipe text for each ordering.
configuration_file
For more information about the configuration files for recipe generation see the recipe-generation-model readme. For generating from an action graph, the configuration file only needs to include the "generator_args" parameter dict.
The specified "model_name_or_path" / "tokenizer_name_or_path need to point to a directory of a trained T5 based amr-to-text generation model which needs to include all the files saved when running the huggingface methods to save a model and a tokenizer.
split_file
The path to a .tsv file with the assignment of the recipes to different splits.
train baked_ziti_7
train garam_masala_8
val waffles_9
test cauliflower_mash_1
...
It is also possible to pass a file that was obtained by running create_recipe2split_assignment
from the recipe-generation-model repository (see the Reproducible Split section). The script will take care of removing the leading path and removing the "_gold.txt" suffix.
split_type
Should be "train", "val" or "test" if the split file has the format shown above, but can be set to any value that occurs in the first column of the split file. Then all recipes where the value in the first column is equal to split_type
are chosen for generation
The repository also contains scripts to use coreference information for making implicit arguments explicit, switching between explicit NP mentions and pronouns, creating a variation of the recipe corpus and optionally including the information for the syntactic-dependency based splitting. This code (located in the coref_processing folder) is at a preliminary state.
Information about coreference clusters, the corresponding AMR nodes and coreferences arising from the AMR splitting can be obtained by running the coref_processing/create_joined_coref.py script.
This requires another subfolder of the data folder as described above which contains one subfolder per dish with .jsonlines coref files.
The paths to the action-level AMR graphs and to the coreference files is specified in utils/paths.py (ACTION_AMR_DIR and RAW_COREF_DIR). Also the path to the output folder that gets created and will contain the generated files is specified in the paths.py script (JOINED_COREF_DIR).
Details about the output format and information included can be found at the top of the script itself.