-
Notifications
You must be signed in to change notification settings - Fork 3
Pipeline Overview
Here, I/we are collecting an overview over the processing pipeline and data types across the whole cookbook project. This is just a sketch for now but I thought I'd upload it already so people could see what I was doing. I'll also add some examples and Wiki pages for the different types of recipe graphs (action graphs, etc.).
%%{init: { 'securityLevel':'loose', 'startOnLoad': 'true' } }%%
graph TD;
corpus[L'20 corpus] -- input --> tp[Tagger & Parser];
tp -- input --> alignment[Alignment Model];
tp -- action graphs --> crowd[Crowdsourcing];
alignment -- top k predictions --> crowd;
crowd -- training data --> alignment;
crowd -- ARA --> alignment;
yama[Y'20 corpus] -- training data --> tp;
alignment -.-> generation;
style generation stroke-dasharray: 5 5;
Repository: processing_microsoft_corpus
graph TD;
A[ ] -- 'common_crawl_text_recipes.json' --> process['process_raw_json_corpus.py'];
B[ ] -- 'recipe_pairwise_alignment' --> original['original_recipe_groupings.py'];
original -- 'original_groupings_of_recipes.json' --> process;
process -- dir structure with raw data --> j2t['recipe_json2tagger.py'];
C[ ] -- 'ara2urls_mappings.json' --> j2t;
j2t -.-> |tagger input files| tagger[Tagger];
style tagger stroke-dasharray: 5 5;
Steps
All recipes of the Microsoft corpus with raw text instructions split up into individual sentences.
Source: Download at
File format: json file with one "entry" for each dish which contains "entries" for all the recipes of that dish (at least one recipe per dish).
Example:
Part of the Microsoft corpus; relevant for its file structure, i.e. for the information about train-dev-test splits and how recipes are grouped into dishes.
Source: download corpus as per these instructions; the relevant data is located at multimodal-aligned-recipe-corpus\recipe-pairwise-alignment
Example:
Read in the structural information in the recipe-pairwise-alignment
directory.
Run: python original_recipe_groupings.py
after adjusting CORPUS_PATH
to point to recipe-pairwise-alignment
.
Output: json file original_grouping_of_recipes.json
with one entry for each recipe in the recipe-pairwise-alignment corpus that specifies to which dish and which data split the recipe belongs.
Example:
Use structural information and raw corpus to create directory structure with train-dev-test splits and dishes.
Example prompt: python process_raw_json_corpus.py --raw common_crawl_text_recipes.json --grouping original_grouping_of_recipes.json --out raw-structured
Output: Creates directory raw-structed
with one sub-directory per data split. Each data split has one sub-directory per dish containing exactly one file. Each line in this file is a json object corresponding to one recipe and with the same information as in common_crawl_text_recipes.json
.
Example file structure:
Example dish file:
Create files that can be used as tagger input. Either uses recipe IDs from ARA (file ara2urls_mappings.json
needs to be in the same directory as the script) or assigns random new integer IDs in order to distinguish between recipes of the same dish.
Example prompt: python recipe_json2tagger_input.py --source raw-structured --target tagger-input
(optional: --ara True
)
Output: Creates directory tagger-input
with one file for each recipe (no sub-directories!). Each file has one 'sentence' json object per line.
Example file:
Repository: tagger-parser
graph TD;
A[ ] -- train.conll03 --> trained[tagger];
B[ ] -- bert-large-eng.jsonnet --> trained;
C[ ] -- train.conllu --> parser;
D[ ] -- parser.jsonnet --> parser;
trained -.-> |prediction pipeline| parser;
parser -.-> |predicted action graphs| am[Alignment Model];
parser -.-> |predicted action graphs| cr[Crowdsourcing];
style am stroke-dasharray: 5 5;
style cr stroke-dasharray: 5 5;
Steps
Setup environment from requirements.txt
or use environment on the Coli servers at proj/cookbook.shadow/conda/envs/allennlp2.8
.
The recommended tagger (as well as additional trained taggers) trained with AllenNLP 2.8 can be found on the Coli servers at /proj/cookbook.shadow/Models_2022.05/bert-large-tagger
.
To train a new tagger with AllenNLP 2.8, navigate to the dev
branch of the repository.
Input:
- training data can be found at
data/English/Tagger/train.conll03
or on the Coli servers at/proj/cookbook.shadow/Yamakata/Data/yamakata_train_240.conll03
. - configutation file; e.g.
tagger/bert-large-eng.jsonnet
. Example prompt:allennlp bert-large-eng.jsonnet -s trained-tagger
after adapting the paths in the configuration file.
Output: creates a folder (e.g.trained-tagger
) containing the trained model, training logs, etc.
The recommended parser trained with AllenNLP 2.8 can be found on the Coli servers at /proj/cookbook.shadow/Models_2022.05/parser
.
To train a new parser with AllenNLP 2.8, navigate to the dev
branch of the repository.
Input:
- training data can be found at
data/English/Parser/train.conllu
or on the Coli servers at/proj/cookbook.shadow/Yamakata/Data/yamakata_train_240.conllu
. - configutation file
parser/parser.jsonnet
. Example prompt:allennlp parser.jsonnet -s trained-parser
after adapting the paths in the configuration file.
Output: creates a folder (e.g.trained-parser
) containing the trained model, training logs, etc.
Repository: tagger-parser
For our paper, we used the AllenNLP 0.8 versions of the tagger and parser (main branch on GitHub); we are currently using the AllenNLP 2.8 versions (dev
branch). For both, there are trained models ready to use on the Coli servers.
graph TD;
A[ ] --> |input.json| tagger[Trained Tagger];
tagger --> |tagged_recipe.json| 2conll[json_to_conll.py];
2conll --> |tagged_recipe.conllu| parser[Trained Parser];
parser --> |parsed_recipe.json| conll[json_to_conll.py];
conll --> |parsed_recipe.conllu| red[reduce_graph.py];
red -.-> |action_graph.conllu| alignment[Alignment Model];
red -.-> |action_graph.conllu| crowd[Crowdsourcing];
red -.-> |fat_graph.conllu| B[ ];
style alignment stroke-dasharray: 5 5;
style crowd stroke-dasharray: 5 5;
style B stroke-dasharray: 5 5;
Steps
You can find trained models here:
- AllenNLP 0.8 tagger (the one we used for the paper):
/proj/cookbook.shadow/Models/yamakata_eng_elmo_300_0
- AllenNLP 2.8 tagger (most recent):
/proj/cookbook.shadow/Models_2022.05/bert-large-tagger/model.tar.gz
- AllenNLP 0.8 parser (the one we used for the paper):
/proj/cookbook.shadow/Models/yamakata_deps_300_0.tar.gz
- AllenNLP 2.8 parser (most recent):
/proj/cookbook.shadow/Models_2022.05/parser/model.tar.gz
The tagger input is typically one recipe per file. Input files are either
a) a file with json 'sentence' objects containing the raw text of one recipe sentence.
Example:
or
b) a file in CoNLL-2003 format (which is also the output format of the tagger post-processing): tsv file where the first column are the tokens of the recipe text (one token in each line). [empty line between sentences?] Note: the recipe text needs to be tokenized if you want to use this option.
Example:
Run conda: source /proj/cookbook.shadow/run_conda.sh
Activate conda environment: conda activate /proj/cookbook.shadow/conda/envs/allennlp2.8
Run tagger: allennlp predict /proj/cookbook.shadow/Models_2022.05/bert-large-tagger [input file as described above] --output-file tagged_recipe.json
(use --use-dataset-reader
flag if the input is in CoNLL-2003 format)
json file with one line per sentence. Tokens and predicted tags are at keywords
Example:
Convert tagger output into CoNLL-U format.
Example prompt: python data-scripts/json_to_conll.py -m tagger -p tagged_recipe.json
(see data-scripts/README.md for additional arguments)
CoNLL-U file (e.g. tagged_recipe.conllu
) for one recipe: tab-separated values, first column ID, second column TOKEN, fourth column predicted TAG, seventh column (HEAD) with default value 0
, eighth column (DEPREL) with default value root
.
Example:
-
data-scripts/error_analysis.py
: creates files for manual error analysis
CoNLL-U file (output of tagger post-processing)
Run conda: source /proj/cookbook.shadow/run_conda.sh
Activate conda environment: conda activate /proj/cookbook.shadow/conda/envs/allennlp2.8
Run parser: allennlp predict /proj/cookbook.shadow/Models_2022.05/parser tagged_recipe.conllu --output-file parsed_recipe.json --use-dataset-reader
json file with one line per sentence. Tokens and predicted tags are at keywords
Example:
Convert parser output into recipe graph[] (CoNLL-U format).
Run: python data-scripts/json_to_conll.py -m parser -p parsed_recipe.json
(see data-scripts/README.md for additional arguments)
CoNLL-U file (e.g. parsed_recipe.conllu
) containing a recipe graph: tab-separated values, first column ID, second column TOKEN, fourth column TAG, seventh column predicted HEAD, eighth column predicted RELATION type to head, additional (HEAD,RELATION) pairs in the ninth column if applicable.
Example:
-
data-scripts/error_analysis.py
: creates files for manual error analysis -
data-scripts/parser_evaluation.py
: performs labelled evaluation on parser outputs
For some tasks, we need simpler graphs than the full recipe graphs generated by the parser. We reduce the set of relevant tags, thus deleting some nodes from the original graph (e.g. only action nodes for action graphs, only food, action and tool nodes for FAT graphs). There is an edge (directed but not labelled) between each pair of nodes where there was a path between these nodes in the original graph.
Run: python data-scripts/reduce_graph.py -m ['a' or 'fat'] -f parsed_recipe.conllu
Creates CoNLL-U file with simplified graph at default output path data/dishname/<recipe_name>.conllu
. All edge labels are 'edge'.
-
data-scripts/reduce_dir_to_action_graphs.py
: callsdata-scripts/reduce_graph.py
for all files in a directoy.
Repository: alignment-models
! Under construction - no details for now, changes ahead !
Steps
Folder structure where the main directory is called recipes-for-training
contains one sub-direcory for each dish and each dish directory contains
- a subdirectory
recipes
with action graphs in.conllu
format - a file
alignments.tsv
with gold alignments in the four columnsfile1, token1, file2, token2
.
Activate or create conda environment (/proj/cookbook.shadow/conda/envs/alignment).
Run: python main.py Alignment-with-feature
! Not implemented yet !
- Folder structure where the main directory is called
recipes-input-model
contains one sub-direcory for each dish and each dish directory contains- a subdirectory
recipes
with action graphs in.conllu
format - a dummy alignment file
alignments.tsv
to provide recipe pairing information (won't elaborate further as this is currently changing).
- a subdirectory
- Trained alignment model (soon to be found on the coli servers).
- Empty target directory
results1
.
Activate or create conda environment (/proj/cookbook.shadow/conda/envs/alignment).
Run: something like python main.py Alignment-with-feature -k 1
Probably, one alignment file per dish called predicted_alignments_<dishname>.tsv
with all the predicted alignments between recipe pairs of that dish.
Repository: crowdsourcing
For each action in a source recipe, use alignment model to predict top k (we've been using k=7) possible aligned actions in a target recipe. Create lists to display in crowdsourcing where the workers can choose between these k options and "None".
graph TD;
A[Alignment Model] -.-> |predcitions.tsv| 2crowd[conllu2crowd_topk.py];
B[Parser] -.-> |CoNLL-U recipe graphs| 2crowd;
2crowd --> |Lists| lingo[LingoTurk];
C[ ] --> |round3-instructions| lingo;
lingo --> |results.csv| post[postprocess_crowdsourcing.py];
post --> |results.tsv| stats[extract_annotations_and_stats.py];
stats -.-> |alignments.tsv| A;
stats -.-> |plots and statistics| plot[ ];
post --> |results.tsv| r2[generate_non_majority_questionnaire.py];
2crowd --> |Lists| r2;
r2 --> |List| lingo;
style A stroke-dasharray: 5 5;
style plot stroke-dasharray: 5 5;
style B stroke-dasharray: 5 5;
Steps
A dir structure, e.g. crowdsourcing-input
with one sub-directory per dish containing
- one file
predictions.tsv
per dish with the top k alignment predictions. - a
\recipes
sub-directory with recipe graphs in individual CoNLL-U format files.
Example dir structure:
Example prediction file:
Pair up recipes from longest to shortest (within dish) and create lists with formatted experiment slides. Actions in the target recipes are coloured and indexed. In the source recipes, actions are bolded and only the up to two experiment items per slide are coloured.
Example prompt: python topk-alignments/conllu2crowd_topk.py crowdsourcing-input --target-indexed
Output: creates directory Lists
with all the lists in tsv format. A lists has one slide per row and the following columns: recipe1, recipe2, question1, question2, options1, options2, indices1, indices2, q1_id, q2_id, documentid1, documentid2 document ID of recipe2, dish_name, slideid.
Example list:
Upload slides to LingoTurk and publish experiment.
- Go to
https://multitude.coli.uni-saarland.de:8080/
and log in (log-in data on crowdsourcing Wiki). - Click
create experiment
. ClickCookBookPPZhai20201220
. - Copy instructions (
instructions/round2-instructions
) and upload lists. Clicksave in database
. - In
view existing experiments
, find the experiment and clickpublish
.
Run experiment.
Results: Experiment results come in csv files with the following columns: filename, listnumber, assignmentid, hitid, workerid, origin, timestamp, partid, questionid, answer, recipe1, recipe2, question1, options1, question2, options2, indices1, indices2, q1, q2, documentid1, documentid2, dishname, id.
Column names might be swapped. Use Post_Processing_maynotwork.ipynb
to fix the data.
Input: Prolific result files.
Example prompt: python postprocess_crowdsourcing.py results.csv
Output: for each input csv file in input, creates a corresponding tsv file with the following columns: worker ID, question ID, question token ID, question doc ID, answer token ID, answer doc ID, numanswers.
Input: post-processed crowdsourcing results
Example prompt: python extract_annotations_and_stats.py results.tsv
Output: printed to command line and displayed in plots.
(Additional statistics might be computed with Post_Processing_maynotwork.ipynb
and action_distributions.py
.)
Haven't found a script for this but the class AnswerExtractor implements a method data_to_majority(dataset)
that may be utilized.
If there's too many questions without a majority vote, a follow-up round of experiments needs to be conducted.
Input: lists and post-processed result files.
Example prompt: python generate_non_majority_questionnaire.py results.tsv
after adapting QUESTIONNAIRE_LOCATION
in the script to point to the lists used for the experiment.
Output: a list file missing_majority_questionnaire.tsv
suitable as experiment input.