YouTokenToMe Ruby

YouTokenToMe - high performance unsupervised text tokenization - for Ruby

Installation

Add this line to your application’s Gemfile:

gem "youtokentome"

Getting Started

Dump your text to a file

Blazingly fast tokenization!

Train a model

model = YouTokenToMe::BPE.train(data: "train.txt", model: "model.txt", vocab_size: 30000)

Load a model

model = YouTokenToMe::BPE.new("model.txt")

Get vocab

model.vocab

Encode

model.encode(sentences)

Decode

model.decode(ids)

Convert between ids and subwords

model.subword_to_id(subword)
model.id_to_subword(id)

Options

Train

YouTokenToMe::BPE.train(
  data: "train.txt",   # path to file with training data
  model: "model.txt",  # path to where the trained model will be saved
  vocab_size: 30000,   # number of tokens in the final vocabulary
  coverage: 1.0,       # fraction of characters covered by the model
  n_threads: -1,       # number of parallel threads used to run
  pad_id: 0,           # reserved id for padding
  unk_id: 1,           # reserved id for unknown symbols
  bos_id: 2,           # reserved id for begin of sentence token
  eos_id: 3            # reserved id for end of sentence token
)

Encode

model.encode(
  sentences,
  output_type: :id,    # or :subword
  bos: false,          # add "beginning of sentence" token
  eos: false,          # add "end of sentence" token
  reverse: false,      # reverse output sequence of tokens
  dropout_prob: 0.0    # BPE-dropout probability
)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/youtokentome-ruby.git
cd youtokentome-ruby
bundle install
bundle exec rake compile
bundle exec rake test

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
ext/youtokentome		ext/youtokentome
lib		lib
test		test
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
youtokentome.gemspec		youtokentome.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTokenToMe Ruby

Installation

Getting Started

Options

History

Contributing

About

Releases

Packages

Languages

License

ankane/youtokentome-ruby

Folders and files

Latest commit

History

Repository files navigation

YouTokenToMe Ruby

Installation

Getting Started

Options

History

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages