🙂 Fast state-of-the-art tokenizers for Ruby
Add this line to your application’s Gemfile:
gem "tokenizers"
Load a pretrained tokenizer
tokenizer = Tokenizers.from_pretrained("bert-base-cased")
encoded = tokenizer.encode("I can feel the magic, can you?")
Create a tokenizer
tokenizer = Tokenizers::Tokenizer.new(Tokenizers::Models::BPE.new(unk_token: "[UNK]"))
Set the pre-tokenizer
tokenizer.pre_tokenizer = Tokenizers::PreTokenizers::Whitespace.new
Train the tokenizer (example data)
trainer = Tokenizers::Trainers::BpeTrainer.new(special_tokens: ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer)
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
Save the tokenizer to a file
Load a tokenizer from a file
tokenizer = Tokenizers.from_file("tokenizer.json")
Check out the Quicktour and equivalent Ruby code for more info
This library follows the Tokenizers Python API. You can follow Python tutorials and convert the code to Ruby in many cases. Feel free to open an issue if you run into problems.
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/tokenizers-ruby.git
cd tokenizers-ruby
bundle install
bundle exec rake compile
bundle exec rake download:files
bundle exec rake test