Skip to content

Latest commit

 

History

History
23 lines (16 loc) · 779 Bytes

README.md

File metadata and controls

23 lines (16 loc) · 779 Bytes

Standard training data

The training data was generated by running scripts/01-generate-synthetic-training-data.py and scripts/02-split-generated-data.py on a list of common english words, available here.

Generating your own training data

If you want to generate your own dataset, you simply need to create a training and a validation file. They follow a simple format:

<CHARACTER SEQUENCE><TAB><TYPE><TAB><SUBTYPE>

Example

ngnix	STRING	PROGRAM
Y29tbWl4dHVyZQ==	HASH	PASSWORD
b3d2cf2ec3894374b37d1b79edd57ad4	HASH	API_KEY
9c795829-75bc-4596-87d3-3508372bbf5f	HASH	API_KEY
licenser	STRING	WORD

NOTE: There are no predefined values for type and subtype.