smile.llm.tokenizer

package smile.llm.tokenizer

LLM Tokenization. Tokens are the fundamental unit, the "atom" of LLMs. Tokenization is the process of translating text and converting them into sequences of tokens and vice versa. A token is not necessarily a word. It could be a smaller unit, like a part of a word, or a larger one like a character or a whole phrase. The size of the tokens vary from one tokenization approach to another.

Byte pair encoding (BPE, also known as digram coding) is a widely used LLM tokenizer with an ability to combine both tokens that encode single characters (including single digits or single punctuation marks) and those that encode whole words (even the longest compound words). The algorithm, in the first step, assumes all unique characters to be an initial set of 1-character long n-grams (i.e. initial "tokens"). Then, successively the most frequent pair of adjacent characters is merged into a new, 2-character long n-gram and all instances of the pair are replaced by this new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be constructed from final vocabulary tokens and initial-set characters.

Related Packages

Package

Description

smile.llm

Large language models.

smile.llm.llama

Meta Llama models.
Class

Description

SentencePiece

SentencePiece is an unsupervised text tokenizer by Google.

Tiktoken

tiktoken is a fast BPE tokenizer by OpenAI.

Tokenizer

Tokenizing and encoding/decoding text.

Package smile.llm.tokenizer