Language Models for Dataset generation
This release includes language models to write text that can be translated by a rule-based text-to-sign translator.
The tlm_14.0.pt (sign_language_translator.models.TransformerLanguageModel) is a custom transformer trained on ~800 MB of text composed only of the words for which PakistanSignLanguage signs are available (see sign_recordings/collection_to_label_to_language_to_words.json). The tokenizer used is sign_language_translator.languages.text.urdu.Urdu().tokenizer + the digits in numbers and letter in acronyms are split apart as individual tokens to limit the vocab size. Later update will generate disambiguated words. The start & end token are "<" & ">".
The -mixed-.pkl model is trained on unambiguous supported urdu words from a corpus of around 10 MB (2.4 million tokens). It is a mix of 6 ngram models with context window size from 1 to 6. It cannot handle any longer range dependencies so concept drift can be observed in the longer generations. The tokenizer used is slt.languages.text.urdu.Urdu().tokenizer. The start & end token are "<" & ">".
The *.json models are made to demonstrate the functionality of n-gram models. The training data is text_preprocessing.json:person_names.
- Contains n-gram based statistical language models trained on 366 Urdu and 366 English names commonly used in Pakistan.
- The models predict the next character based on previous 1-3 characters.
- The start and end of sequence tokens are "[" and "]".