YouTokenToMe - high performance unsupervised text tokenization - for Ruby
Learn more about how it works
Add this line to your application’s Gemfile:
gem "youtokentome"
Dump your text to a file
Blazingly fast tokenization!
Train a model
model = YouTokenToMe::BPE.train(data: "train.txt", model: "model.txt", vocab_size: 30000)
Load a model
model = YouTokenToMe::BPE.new("model.txt")
Get vocab
model.vocab
Encode
model.encode(sentences)
Decode
model.decode(ids)
Convert between ids and subwords
model.subword_to_id(subword)
model.id_to_subword(id)
Train
YouTokenToMe::BPE.train(
data: "train.txt", # path to file with training data
model: "model.txt", # path to where the trained model will be saved
vocab_size: 30000, # number of tokens in the final vocabulary
coverage: 1.0, # fraction of characters covered by the model
n_threads: -1, # number of parallel threads used to run
pad_id: 0, # reserved id for padding
unk_id: 1, # reserved id for unknown symbols
bos_id: 2, # reserved id for begin of sentence token
eos_id: 3 # reserved id for end of sentence token
)
Encode
model.encode(
sentences,
output_type: :id, # or :subword
bos: false, # add "beginning of sentence" token
eos: false, # add "end of sentence" token
reverse: false, # reverse output sequence of tokens
dropout_prob: 0.0 # BPE-dropout probability
)
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/youtokentome-ruby.git
cd youtokentome-ruby
bundle install
bundle exec rake compile
bundle exec rake test