Package smile.llm.tokenizer


package smile.llm.tokenizer
LLM Tokenization. Tokens are the fundamental unit, the "atom" of LLMs. Tokenization is the process of translating text and converting them into sequences of tokens and vice versa. A token is not necessarily a word. It could be a smaller unit, like a part of a word, or a larger one like a character or a whole phrase. The size of the tokens vary from one tokenization approach to another.

Byte pair encoding (BPE, also known as digram coding) is a widely used LLM tokenizer with an ability to combine both tokens that encode single characters (including single digits or single punctuation marks) and those that encode whole words (even the longest compound words). The algorithm, in the first step, assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial "tokens"). Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be constructed from final vocabulary tokens and initial-set characters.

  • Class
    Description
    SentencePiece is an unsupervised text tokenizer by Google.
    tiktoken is a fast BPE tokenizer by OpenAI.
    Tokenizing and encoding/decoding text.