A BPE tokenizer for use with OpenAI's models,
ported and referenced from tiktoken and SharpToken.
Go standard library doesn't support PCRE, so it depends on go-pcre.
It requires libpcre3-dev
or libpcre++-dev
to be installed on the system.
package main
import (
"log"
"github.com/meinside/geektoken"
)
func main() {
//text := "Hellow, world!"
text := "나는 우리나라가 세계에서 가장 아름다운 나라가 되기를 원한다. 가장 부강한 나라가 되기를 원하지 않는다."
tokenizer, _ := geektoken.GetTokenizerWithModel(geektoken.ModelGPT35Turbo)
if encoded, err := tokenizer.Encode(text, nil, nil); err == nil {
log.Printf("encoded token: %+v, token count = %d", encoded, len(encoded))
}
}
- Some encoded bytes differ from the ones from other BPE libraries
- Add more tests
- Optimize codes
MIT