Vocabulary size too high

If you see an error like this during preprocessing:

RuntimeError: Internal: src/trainer_interface.cc(590) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (24000). Please set it to a value <= 12638.

This indicates that Sentencepiece was not able to create a vocabulary of the size specified. There is no indication in the error message as to whether it is the source or target vocabulary which is set too high. So you may like to set those to different values initially so that you can see which needs to be modified. This issue often arises when we are working with the small corpus of a New Testament or Bible in a low resource language. When training a parent model with millions of sentences much larger vocabulary sizes are possible. For many of our experiments we've found that the maximum vocab size gives a good result.

If this config file :

data:
  corpus_pairs:
  - type: train,val,test
    src: src-text
    trg: trg-text
  share_vocab: false
  src_vocab_size: 24000
  trg_vocab_size: 32000

caused the error above while being preprocessed then editing like this:

data:
  corpus_pairs:
  - type: train,val,test
    src: src-text
    trg: trg-text
  share_vocab: false
  src_vocab_size: 12638
  trg_vocab_size: 32000

should solve the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary size too high

Clone this wiki locally