Skip to content

Pure Go implementation of OpenAI's tiktoken tokenizer

License

Notifications You must be signed in to change notification settings

j178/tiktoken-go

 
 

Repository files navigation

Tests

tiktoken-go

Note

This is a fork of tiktoken-go/tokenizer with some API changes.

This is a pure go port of OpenAI's tokenizer.

Buy Me A Coffee

Usage

package main

import (
	"fmt"

	"github.com/j178/tiktoken-go"
)

func main() {
	enc, err := tiktoken.Get(tiktoken.Cl100kBase)
	if err != nil {
		panic("oh oh")
	}

	// this should print a list of token ids
	ids, _, _ := enc.Encode("supercalifragilistic")
	fmt.Println(ids)

	// this should print the original string back
	text, _ := enc.Decode(ids)
	fmt.Println(text)
}

Alternatively you can use the included command-line tool

> tokenizer -h

Usage of tokenizer:
  -decode string
        tokens to decode
  -encode string
        text to encode
  -token string
        text to calculate token

> tokenizer -encode supercalifragilistic

Todo

  • ✅ port code
  • ✅ cl100k_base encoding
  • ✅ r50k_base encoding
  • ✅ p50k_base encoding
  • ✅ p50k_edit encoding
  • ✅ tests
  • ❌ handle special tokens
  • ❌ gpt-2 model

Caveats

This library embeds OpenAI's vocabularies—which are not small (~4Mb)— as go maps. This is different than what the way python version of tiktoken works, which downloads the dictionaries and puts them in a cache folder.

However, since the dictionaries are compiled during the go build process the performance and start-up times should be better than downloading and loading them at runtime.

Alternatives

Here is a list of other libraries that do something similar.

About

Pure Go implementation of OpenAI's tiktoken tokenizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages

  • Go 100.0%