NMT: Usage

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

experiment

The experiment tool runs the preprocess, train, and test tools in succession if none of the individual parts are specified.

usage: python -m silnlp.nmt.experiment [-h] [--stats] [--force-align] [--disable-mixed-precision]
[--num-devices NUM_DEVICES] [--clearml-queue QUEUE] [--save-checkpoints]
[--preprocess] [--train] [--test] [--translate] [--score-by-book] [--mt-dir DIR] [--debug]
[--commit ID] [--scorers [scorer [scorer ...]]] [--multiple-translations]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--stats`	Compute tokenization statistics	Compute tokenization statistics.
`--force-align`	Force recalculation of all alignment scores	Only relevant when using the --stats option.
`--disable-mixed-precision`	Disable mixed precision	Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More...
`--num-devices NUM_DEVICES`	Number of devices to train on	To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using `--num-devices 2` then `set CUDA_VISIBLE_DEVICES=0,1`
`--clearml-queue QUEUE`	ClearML queue	Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
`--save-checkpoints`	Save checkpoints to s3 bucket	Save checkpoints to s3 bucket.
`--preprocess`	Run the preprocess step	Run the preprocess step.
`--train`	Run the train step	Run the train step.
`--test`	Run the test step	Run the test step.
`--translate`	Create drafts	See here for more details.
`--score-by-book`	Score individual books	In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set.
`--mt-dir DIR`	The machine translation directory	Use an alternative machine translation directory for the location of the experiment.
`--debug`	Show debug information	Show information about the environment variables and arguments.
`--commit ID`	Commit ID	The silnlp git commit id with which to run a remote job.
`--scorers [scorer [scorer ...]]`	Set scorers	Specifies the list of scorers to be used on the predictions. Default is ['bleu', 'sentencebleu', 'chrf3', 'chrf3++', 'wer', 'ter', 'spbleu']. Additional options are 'chrf+' and 'meteor'.
`--multiple-translations`	Produce multiple drafts	If the translate or test steps are being performed, produce multiple drafts of the input data or test data, respectively. When translating, the system will produce multiple output files, one for each draft. In testing, a new column has been added to the output to specify the draft number (1, 2, etc.). See here for more details.

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

splitting the source and target files into the training, validation, and test data sets;
writing the train/validate/test data sets to files in the subfolder;
adapting the tokenizer of the parent model to be used by this experiment.
generating tokenization statistics about the data

usage: python -m silnlp.nmt.preprocess [-h] [--stats] [--force-align] experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--stats`	Compute tokenization statistics	Compute tokenization statistics.
`--force-align`	Force recalculation of all alignment scores	Only relevant when using the --stats option.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: python -m silnlp.nmt.train [-h] [--diable-mixed-precision]
[--num-devices NUM_DEVICES]
experiments [experiments ...]

Arguments:

Argument	Purpose	Description
`experiments`	Experiment names	The names of the experiments to train. Each experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--disable-mixed-precision`	Disable mixed precision	Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More...
`--num-devices NUM_DEVICES`	Number of devices to train on	To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using `--num-devices 2` then `set CUDA_VISIBLE_DEVICES=0,1`

test

The test tool tests the neural model for an experiment. If no trained model exists in the experiment folder, the base model will be used.

usage: python -m silnlp.nmt.test [-h] [--checkpoint CHECKPOINT]
[--last] [--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books BOOKS] [--by-book]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment to test. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the `run` subfolder of the specified experiment.
`--last`	Test the last checkpoint	Use the last training checkpoint to generate target language predictions.
`--best`	Test the best checkpoint	Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the `run > export` subfolder of the specified experiment.
`--avg`	Test the averaged checkpoint	Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the `train: average_last_checkpoints: _<n>_` option, or it can be manually generated after training by using the average_checkpoints tool.
`--ref-projects [project [project ...]]`	Reference projects	The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
`--force-infer`	Force inferencing	If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
`--scorers [scorer [scorer ...]]`	Set scorers	Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'.
`--books BOOKS`	Books to score	Specifies one or more books/chapters to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s)/chapter(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis) and follow the syntax found here.
`--by-book`	Score individual books	In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the `--books` option, individual scores are provided for each of the specified books.

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

Using a trained model to translate the text in a file from the source language to a target language.
Using a trained model to translate the text in a sequence of files into a target language.
Using a trained model to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: python -m silnlp.nmt.translate [-h] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG]
[--src-prefix SRC_PREFIX] [--trg-prefix TRG_PREFIX] [--start-seq START_SEQ] [--end-seq END_SEQ]
[--src-project SRC_PROJECT] [--trg-project TRG_PROJECT]
[--books BOOKS] [--src-iso LANG] [--trg-iso LANG]
[--include-inline-elements] [--stylesheet-field-update ACTION] [--multiple-translations]
[--clearml-queue QUEUE] [--debug] [--commit ID]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command will translate the sentences in a text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src SRC`	Source file	Name of a text file with the source language sentences to be translated (one sentence per line). The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
`--trg TRG`	Target file	Name of the text file where the translated sentences will be written (one per line).
`--src-iso LANG`	Source language ISO code	The ISO code for the source language.
`--trg-iso LANG`	Target language ISO code	The ISO code for the target language.
`--multiple-translations`	Produce multiple drafts	Produce a number of drafts equal to `num_drafts` in `config.yml`. The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified `--trg output.txt`, files named `output.1.txt`, `output.2.txt`, etc. will be created. See here for more details.
`--clearml-queue QUEUE`	ClearML queue	Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
`--debug`	Show debug information	Show information about the environment variables and arguments.
`--commit ID`	Commit ID	The silnlp git commit id with which to run a remote job.

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command will translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src-prefix SRC_PREFIX`	Source file prefix (e.g., de-news2019-)	The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
`--trg-prefix TRG_PREFIX`	Target file prefix (e.g., en-news2019-)	The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
`--start-seq START_SEQ`	Starting file sequence number	The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
`--end-seq START_SEQ`	Ending file sequence number	The final source language file sequence number to translate.
`--src-iso LANG`	Source language ISO code	The ISO code for the source language.
`--trg-iso LANG`	Target language ISO code	The ISO code for the target language.
`--multiple-translations`	Produce multiple drafts	Produce a number of drafts equal to `num_drafts` in `config.yml`. The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified `--trg-prefix output_` and `--end-seq 2`, files named `output_0000.1.txt`, `output_0000.2.txt`, `output_0001.1.txt`, etc. will be created. See here for more details.
`--clearml-queue QUEUE`	ClearML queue	Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
`--debug`	Show debug information	Show information about the environment variables and arguments.
`--commit ID`	Commit ID	The silnlp git commit id with which to run a remote job.

Paratext book (USFM file)

Using the combination of command line arguments described in this section, the translate command will translate a book from a Paratext project into the requested target language. The translated text is written into a USFM-formatted file with markup that closely follows the markup in the source book.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiments to test. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src-project SRC_PROJECT`	The source project to translate	The name of the source Paratext project. The project name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > Paratext > projects` folder.
`--trg-project TRG_PROJECT`	Target project	The name of the target Paratext project that will fill in missing text for books that are not entirely translated. The project name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > Paratext > projects` folder.
`--books BOOKS`	The books to translate	A list of the books/chapters in the source Paratext project to be translated. Book identifiers should follow the USFM 3.0 standard and the selections should follow the syntax found here. If multiple selections are being made, put the selections in quotes so that the semicolons are not misinterpreted.
`--trg-iso LANG`	Target language ISO code	The ISO code for the target language.
`--include-inline-elements`	Keep inline elements in USFM files	Keeps inline USFM elements such as footnotes and cross references. Default behavior is to remove these elements before translating.
`--stylesheet-field-update ACTION`	Handle USFM style conflicts	What to do with the OccursUnder and TextProperties fields of a project's custom stylesheet. Possible values are 'replace', 'merge' (default), and 'ignore'.
`--multiple-translations`	Produce multiple drafts	Produce a number of drafts equal to `num_drafts` in `config.yml`. The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified `--books JOL`, then in the target project's `run` directory, files named `29JOL.1.SFM`, `29JOL.2.SFM`, etc. will be created. See here for more details.
`--clearml-queue QUEUE`	ClearML queue	Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
`--debug`	Show debug information	Show information about the environment variables and arguments.
`--commit ID`	Commit ID	The silnlp git commit id with which to run a remote job.

Assessing Data Suitability for Training a Model

analyze_project_pairs

Gets verse counts and computes alignment scores for pairs of biblical texts. Outputs the raw counts/scores and optionally summarizes the information in Excel files

Configuration information: The script functions the same way as an experiment in that it operates within an experiment folder and uses a reduced version of an experiment's config.yml file. It only expects the "data" section of the config file to exist*. Within the data section, it only looks at the "aligner" and "corpus_pairs" fields. Within each corpus pair, it uses the "src", "trg", "mapping", "corpus_books", and "score_threshold" fields. See here for definitions and default values for each field.

*It will also optionally look at the "model" field to check if the model was trained on any data with the same script as the given data.

usage: python -m silnlp.nmt.analyze_project_pairs [-h] [--create-summaries] [--recalculate]
[--deutero] [--clearml-queue QUEUE] experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment folder	The name of the subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder containing the config.yml file and where outputs will be written to.
`--create-summaries`	Create summary Excel files	Creates two files, one more general file containing verse counts and high level alignment stats, and another with a more in-depth breakdown of the alignment scores.
`--recalculate`	Force recalculation of all verse counts and alignment scores	Verse counts are cached globally but alignments will always be created from scratch the first time a given experiment is run and will be stored in the experiment folder.
`--deutero`	Include books from the Deuterocanon	A warning message will be printed for each text that has books from the Deuterocanon when this option is not used.
`--clearml-queue QUEUE`	ClearML queue	Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. `analyze_project_pairs` is a CPU-intensive script that will not benefit from (and in fact will probably be slowed down by) a GPU-only queue.

Analyzing experiment metadata

alphabet_similarity

Calculates alphabet similarity between text corpora in a multilingual data set.

usage: python -m silnlp.nmt.alphabet_similarity [-h] experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.

segment_length

Display a histogram of segment lengths in tokens.

usage: python -m silnlp.nmt.segment_length [-h] experiment filename

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`filename`	Tokenized file in experiment folder	Tokenized file in experiment folder.

vocab_overlap

Calculate the vocab overlap between two experiments.

usage: python -m silnlp.nmt.vocab_overlap [-h] exp1 exp2

Arguments:

Argument	Purpose	Description
`exp1`	Experiment 1 name	The name of the first experiment to compare. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`exp2`	Experiment 2 name	The name of the second experiment to compare. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.

Analyzing the results of an experiment

check_train_val_test_split

After a model has been trained and used to generate predictions for the test set, the check_train_val_test_split tool can be used to analyze the word distributions across the train, validate, and test sets for the source and target corpora. By default, the tool will generate high-level statistics regarding the occurrence of "unknown" words (i.e., words that occur in the validation set or in the test set, but not in the training set). The tool can also be used to generate detailed listings of these unknown words and their occurrence counts. It is also possible to have the tool compare these unknown words to the valid words found in the training set to identify possible misspellings. Output is saved in the word_count.xlsx file in the specified experiment folder.

usage: python -m silnlp.nmt.check_train_val_test_split [-h]
[--details] [--similar-words]
[--distance DIST] [--detok-val]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiments to check. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--details`	Show detailed word lists	Generate detailed lists of validation set and test set words that are not found in the training set. Separate lists are generated for the source and target corpora. Occurrence counts are provided for each identified word.
`--similar-words`	Find similar words	Compare each unknown words to the valid words found in the training set and identify possible misspellings in the validation and test set. Levenshtein distance is used to identify the possible misspellings.
`--distance DIST`	Maximum Levenshtein distance for word similarity	By default, a Levenshtein distance of 1 is used to identify similar words in the training set. This parameter can be used to specify a different distance.
`--detok-val`	Detokenize the target validation set	Detokenize the target validation set.

diff_predictions

The diff_predictions tool can be used to compare the test set predictions to the reference sentences for an experiment. The tool generates a spreadsheet (diff_predictions.xlsx) with multiple comparison tabs. The comparison includes the test set source text, the target language reference text, the predictions, and the sentence-level BLEU scores for the predictions. Optionally, the tool can mark-up each prediction to identify the differences between the reference text and the prediction. The source text can also be marked up to highlight test set words that are not found in the training set. Optionally, the training set source / target sentence pairs can be included in the output spreadsheet on a separate tab.

usage: python -m silnlp.nmt.diff_predictions [-h] [--last]
[--show-diffs] [--show-unknown] [--show-dict]
[--include-train] [--include-dict] [--analyze-digits]
[--preserve-case] [--tokenize TOK] [--scorers [scorer [scorer ...]]]
exp1

Arguments:

Argument	Purpose	Description
`exp1`	Experiment name	The name of the experiment to compare. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--last`	Use last result	Use last result instead of best one.
`--show-diffs`	Show differences (predictions vs reference)	Mark up the predictions to indicate where they differ from the reference text.
`--show-unknown`	Show unknown words in source verse	Mark up the test set source sentences to indicate words that do not occur in the training set.
`--show-dict`	Show dictionary words in source verse	Show dictionary words in source verse.
`--include-train`	Include the src/trg training corpora in the spreadsheet	Include the parallel source/target training sentence pairs in another tab in the spreadsheet.
`--include-dict`	Include the src/trg dictionary in the spreadsheet	Include the src/trg dictionary in the spreadsheet.
`--analyze-digits`	Perform digits analysis	Perform digits analysis.
`--preserve-case`	Score predictions with case preserved	Preserve case when calculating the sentence-level BLEU score for the source/target sentence pairs. By default, the tool will lower case the source and target. Note that this behavior is secondary to the source / target case settings specified in the config.yml file; if those settings specified lower casing, then this argument has no effect.
`--tokenize TOKENIZE`	Sacrebleu tokenizer (none,13a,intl,zh,ja-mecab,char)	Specifies the Sacrebleu tokenizer that will be used to calculate the sentence-level BLEU score for each source/target sentence pair. (Default: 13a)
`--scorers [scorer [scorer ...]]`	List of scorers	Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMT: Usage

Setting up and running an experiment

experiment

preprocess

train

test

translate

Text file

Sequence of Text Files

Paratext book (USFM file)

Assessing Data Suitability for Training a Model

analyze_project_pairs

Analyzing experiment metadata

alphabet_similarity

segment_length

vocab_overlap

Analyzing the results of an experiment

check_train_val_test_split

diff_predictions

Clone this wiki locally