- Author: Juraj Dedič
- This directory was used to convert the human transcript in the
XML
files into dataset format used for training
- Python packages:
bs4
You can run the script in following way
python3 convert.py <input_dir> [True (if augmentation_enabled)]
- OOP seemed as a reasonable way to solve this
- The
converter.py
script reads the provided directory name and finds allXML
files - It then procceeds to extract the entities and assign them to the
Utterance
object which holds theEntity
object. - This way all the XML files are converted to the
Utterance
objects - After that there are callsigns extracted from the sentences (the first callsign)
- Augmentation can be optionally performed in this step (Which did not increase the scores so it was not used later)
- After that the dataset is converted to iterable format which is then processed into
Spacy DocBin
- The output
.spacy
file holding the dataset written in the same name is the input directory - It also generates
.list
file of all the filenames of the original files used
- this file converts the
DocBin .spacy
file toJSON
for using it Pytorch
- this directory is an example of most commonly used way of running the script
- simply runnig
python convert.py train
and the output wastrain.spacy