Title: Extraction of Gateways in Process Model Generation from Text
Author: Janek Putz
Supervisor: Prof. Dr. Han van der Aa (Data and Web Science Group, University of Mannheim)
Date: Sep 2022 - Feb 2023
Keywords: Business Process Management, Process Discovery, Process Model Extraction, Gateway Extraction, Keyword Search, Sequence Flow Detection, NLP, Token Classification, Activity Relations, Relation Classification
This thesis investigates the extraction of gateways in the scope of extracting process models from unstructured documents. The repository contains code for implementation and evaluation of the following components:
- Traditional rule-based approach to extract gateways by keyword search and derive sequence flows
- Token classification model to filter false positive gateway extractions from the rule-based approach
- A model to support the rule-based approach in detecting if two gateway entities describe branches of the same gateway construct
- A novel approach how to extract gateways by analyzing the relation between activities
The work is based on a new dataset for process extraction from natural language texts called PET. It contains extensive annotations for various process elements.
- Huggingface: patriziobellan/PET
- Website and Paper
Gateway extraction and subsequent sequence flow derivation are demonstrated with the following example:
Input (doc-3.2, PET dataset):
Each morning, the files which have yet to be processed need to be checked, to make sure they are in order for the court hearing that day.
If some files are missing, a search is initiated, otherwise the files can be physically tracked to the intended location.
Once all the files are ready, these are handed to the Associate, and meantime the Judgeis Lawlist is distributed to the relevant people.
Afterwards, the directions hearings are conducted.
Detailed usage and parameterization options are documented in each class and function.
/token_approaches contains all code for the rule-based approach and its extension of token and same gateway classification. In order to ...
- evaluate the rule-based approach on all documents, test single documents or rerun the paper statistics, run RuleApproach.py
- train a false positive filter model, adjust the namespace arguments in GatewayTokenClassifier.py and execute
- use the false positive filter extension in the rule-based approach, set the path to the saved model in config.json, adjust the run section of RuleTokenFilteredApproach.py and run
- train a same gateway classification model, adjust the namespace arguments in SameGatewayClassifier.py and execute
- use the same gateway classification extension in the rule-based approach, set the path to the saved model in config.json, adjust the run section of RuleSGCApproach.py and execute
The training of different architecture refinements for both models can be executed automatically via scripts in ../run_scripts/same_gateway_cls and ../run_scripts/token_cls that set the respective paramterizations and initiates the training.
/relation_approaches contains all code for the extraction of gateways analyzing activity relations. In order to ...
- train a activity relation classification model, adjust the namespace arguments in RelationClassifier.py and execute
- evaluate a activity relation classification model, adjust the dummy namespace arguments in
get_dummy_args
in RelationClassificationBenchmark.py to the architecture values used during training and execute - test gateway extraction on single documents, configure the main section in GatewayExtractor.py with a relation classifier instance and desired document and execute
- evaluate the gateway extraction on all or desired test documents, configure the main section in GatewayExtractionBenchmark.py with a relation classifier instance and execute
The training of different architecture refinements for the relation classification model can be executed automatically via scripts in ../run_scripts that set the respective paramterizations and initiates the training.
Additional runnable code:
- /important_analysis contains additional scripts to analyze data in results to generate stats used in the paper
- /notebooks contains various Jupyter notebooks used during development
More classes and scripts exist that are not intended for stand alone execution.
- PetReader.py wraps the
petdatasetreader
module to read patriziobellan/PET with customized reading functions - training.py and Ensemble.py provide structures used to train models of both token and activity relation approaches
- utils.py collects various helper methods and functionality to read statc information from files in /data
- matplotlib~=3.6.3
- numpy~=1.23.0
- openpyxl~=3.0.10
- pandas~=1.4.3
- petbenchmarks~=0.0.1a3
- petdatasetreader~=0.0.1b2
- scikit-learn~=1.1.3
- tensorflow~=2.8.0
- tensorflow-addons~=0.18.0
- transformers~=4.20.1