Skip to content

Latest commit

 

History

History
146 lines (101 loc) · 11.9 KB

README.md

File metadata and controls

146 lines (101 loc) · 11.9 KB

recipe-generation

This repository contains the code for the paper from From Sentence to Action: Splitting AMR Graphs for Recipe Instructions, DMR 2023.

It contains the scripts to separate sentence-level AMR graphs into AMR graphs for individual action events and to generate action-event level recipe instructions for the obtained AMRs.
The code for training the generation model by fine-tuning a T5-based AMR-to-text model for this task can be found in the recipe-generation-model repository.
The Wiki contains more details about the algorithms implemented and AMR graphs structures and file format.

The implemented steps of the overall pipeline are

  1. Parsing each recipe sentence by sentence into AMR graphs with recipe-level node-to-token alignments, see AMR Parsing
  2. Separating the AMR graphs into sub-graphs in order to get one AMR per action-event in the corresponding action graph for the recipe, see AMR Splitting
  3. Extracting approximated gold instructions for the split action-level amr graphs as well as extracting action-level instructions based on dependency information only, see Extraction
  4. Generating a recipe text based on an action graph, the amr graphs corresponding to each action node and a graph traversal, see Generating Recipe Texts

Requirements

Tested with Python 3.6 and newer versions.

Run pip install -e . in the main repository directory. This will enable successful import of all modules and functions within the repository. Additionally, this will already install most of the dependencies with the following two exceptions:

The pytorch library (e.g. 1.10.1)

Transformers from Huggingface (version 3 will probably not work): (e.g. 4.11.3). Depending on your OS and environment set up you can use one of these two commands for the installation.

  • conda install -c huggingface transformers
  • pip install transformers

Note: Running the AMR parser requires further dependencies than the ones listed in the current section (see amr_parsing Readme).

AMR Parsing

See the Readme in the amr_parsing folder for more details on creating the AMR representations of a recipe corpus and the requirements. If a dataset of recipe AMRs with node-to-token alignments is already available, the amr_parsing subfolder can be excluded to avoid the need to install the dependencies for the parser.

AMR Splitting

For the details on how the AMR splitting algorithm works see the Wiki.

Create a folder data in the main project folder. Add the folder with the ARA 1.1 corpus to the data folder and call it ara1.1.

Create a folder data_ara2 in the main project folder. Add the folder with the ARA2 corpus to the data_ara2 folder and call it ara2.0

Additionally, add the folder with the parsed sentence-level AMRs (including node-token alignments matching the token IDs of the ARA corpus) and call it recipe_amrs_sentences.

Instead of naming the folders as explained above, you can simply adapt the ARA_DIR and SENT_AMR_DIR variables in utils/paths.py to match your folder structure.

Folder structures should be

---data
  |---ara1.1
    |---dish1
       |---recipes
          |---dish1_0.conllu
          |---dish1_1.conllu
          ...
       |---alignments.tsv
    |---dish2
    ...
  |---recipe_amrs_sentences
     |---dish1
         |---dish1_0_sentences_amr.txt
         |---dish1_1_sentences_amr.txt
         ...
     |---dish2
     ...

Then run the amr_splitting.py script. It will run the AMR splitting algorithm on all AMRs in the recipe_amrs_sentences folder. The separated version of the corpus will be stored in the (automatically created) folder data/recipe_amrs_actions with one subfolder per dish, directly containing the .txt files for each recipe.

Additionally, two logging files will be created in the (automatically created) logs folder.

  • non_separable_amrs.txt: lists the names of all AMRs that were not separable as well as those that were separable using the fallback cases
  • splitting_log.txt: additional information about the dataset, e.g. number of AMRs before splitting, number of AMRs after splitting, ...
    Each log file gets the date and time, at which it was created, as a unique prefix to avoid overwriting.

Extracting Gold Instructions

The separated AMRs that the splitting algorithm produces still include the original sentence corresponding to the original AMR as their '::snt'' meta data. In order to extract instructions for the separated action-level AMRs navigate to training/prepare_data_sets and run the following:

python generate_gold_action_instructions.py --sep_dir [sep_dir] --orig_dir [orig_dir] --ara_dir [ara_dir] --out_dir [out_dir] --text

  • sep_dir: optional; path to the parent directory with the separated action-level amrs, defaults to ACTION_AMR_DIR defined in utils/paths
  • orig_dir: optional; path to the parent directory with the original sentence-level amrs, defaults to SENT_AMR_DIR defined in utils/paths
  • ara_dir: optional; path to the parent directory of the ara corpus, defaults to ARA_DIR defined in utils/paths
  • output dir: path to the directory where the generated instructions should be saved, gets created if it doesn't exist yet; will have the same folder structure as the A-AMR dir
  • coref: optional; whether to use coreference information in order to remove redundant actions; Requires prepared coreference files in the JOINED_COREF_DIR directory; defaults to False
  • text: optional; whether only the generated instructions should be save in output files or the A-AMRs together with the extracted gold instructions as part of the metadata; defaults to False

For more details about the extraction itself see the wiki page.

Generating Recipe Texts

In order to generate one action-event level recipe based on a (specific) action graph, run
python generate_recipe.py --file [action_graph_file] --cont [context_len] --order [ordering_version] --config [configuration_file] --out [output_file]

  • action_graph_file: path to .conllu file with the action graph of the recipe
  • context_len: number of previously generated sentences to include as input to the generation model (should not be larger than the context_len the model was trained with)
  • ordering_version: optional; version of the traversing function to use, default is "pf-lf-id"
  • configuration_file: path to .json file with the configurations for the generation
  • output_file: optional; the path to the file where the generated texts will be saved, each tab separated column will contain the recipe from one traversal, one sentence per line, first line is a header; if not provided then the generated texts will only be printed to the command line

In order to generate all action-event level recipes of a dataset split run
python generate_data_set_split.py --split [split_file] --type [split_type] --cont [context_length] --order [ordering_version] --config [configuration_file] --out [output_directory]

  • split_file: file with the assignments of the recipes to the different dataset splits
  • split_type: the name of the split to generate
  • context_len: number of previously generated sentences to include as input to the generation model (should not be larger than the context_len the model was trained with)
  • ordering_version: optional; version of the traversing function to use, default is "pf-lf-id"
  • configuration_file: path to .json file with the configurations for the generation
  • output_dir: optional; the path to the directory where the generated texts will be saved, if not provided it will default to a folder "output" in the main project directory; for each recipe specified in split_file one output file will be created (named [recipe_name]_generated.txt), one instruction per line; if ordering_version is "all" then the output file will contain one one column per traversal (tab separated) with a header line at the top

ordering_version
Can be "top", "ids", "pf", "pf-lf" or "pf-lf-id", see wiki page for details of the different traversals, or can be set to "all" to generate one recipe text for each ordering.

configuration_file
For more information about the configuration files for recipe generation see the recipe-generation-model readme. For generating from an action graph, the configuration file only needs to include the "generator_args" parameter dict.
The specified "model_name_or_path" / "tokenizer_name_or_path need to point to a directory of a trained T5 based amr-to-text generation model which needs to include all the files saved when running the huggingface methods to save a model and a tokenizer.

split_file
The path to a .tsv file with the assignment of the recipes to different splits.

train    baked_ziti_7
train    garam_masala_8
val      waffles_9
test     cauliflower_mash_1
...

It is also possible to pass a file that was obtained by running create_recipe2split_assignment from the recipe-generation-model repository (see the Reproducible Split section). The script will take care of removing the leading path and removing the "_gold.txt" suffix.

split_type
Should be "train", "val" or "test" if the split file has the format shown above, but can be set to any value that occurs in the first column of the split file. Then all recipes where the value in the first column is equal to split_type are chosen for generation

Making use of Coreference Information

The repository also contains scripts to use coreference information for making implicit arguments explicit, switching between explicit NP mentions and pronouns, creating a variation of the recipe corpus and optionally including the information for the syntactic-dependency based splitting. This code (located in the coref_processing folder) is at a preliminary state.

Creating Joined Coref Files

Information about coreference clusters, the corresponding AMR nodes and coreferences arising from the AMR splitting can be obtained by running the coref_processing/create_joined_coref.py script.

This requires another subfolder of the data folder as described above which contains one subfolder per dish with .jsonlines coref files.

The paths to the action-level AMR graphs and to the coreference files is specified in utils/paths.py (ACTION_AMR_DIR and RAW_COREF_DIR). Also the path to the output folder that gets created and will contain the generated files is specified in the paths.py script (JOINED_COREF_DIR).

Details about the output format and information included can be found at the top of the script itself.