This repo contains the wrapper script and Dockerfile for running the PyMUSAS tagging of the CoNLL-U format of the ParlaMint project corpora translated to English
Previous discussions with Tomaž and Matyáš about the output format are at clarin-eric/ParlaMint#204
First, create the docker image from the Dockerfile and other files in the DockerBuild
folder, naming it as follows:
docker build -t pymusas_conllu .
Something like the following ...
- Unzip the file
- cd into directory showing all the year folders
- run
pymusas_parlamint_wrapper.sh
script - monitor timestamps and output files until complete (speed is approx. 3 million words/hour on the UCREL VM)
- create a tar.gz of the output files
- copy the tarball to https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ or some other wget-able location