The training data was generated by running scripts/01-generate-synthetic-training-data.py
and scripts/02-split-generated-data.py
on a list of common english words, available here.
If you want to generate your own dataset, you simply need to create a training and a validation file. They follow a simple format:
<CHARACTER SEQUENCE><TAB><TYPE><TAB><SUBTYPE>
Example
ngnix STRING PROGRAM
Y29tbWl4dHVyZQ== HASH PASSWORD
b3d2cf2ec3894374b37d1b79edd57ad4 HASH API_KEY
9c795829-75bc-4596-87d3-3508372bbf5f HASH API_KEY
licenser STRING WORD
NOTE: There are no predefined values for type
and subtype
.