java.lang.Object

smile.llm.tokenizer.Tiktoken

All Implemented Interfaces:: Tokenizer

Direct Known Subclasses:: Tokenizer

public class Tiktoken extends Object implements Tokenizer

tiktoken is a fast BPE tokenizer by OpenAI.

Field Summary

Fields

Modifier and Type

Field

Description

protected final Map<Bytes,Integer>

ranks

Token -> Rank

protected final Map<String,Integer>

specialTokens

Special Token -> Rank
Constructor Summary

Constructors

Constructor

Description

Tiktoken(Pattern pattern, Map<Bytes,Integer> ranks, String bos, String eos, String... specialTokens)

Constructor.
Method Summary

Modifier and Type

Method

Description

void

allowSpecialTokens(boolean allowSpecialTokens)

Sets how special tokens will be encoded.

String

decode(int[] tokens)

Decodes a list of token IDs into a string.

int[]

encode(String text)

Encodes a string into a list of token IDs.

int[]

encode(String text, boolean bos, boolean eos)

Encodes a string into a list of token IDs.

boolean

isSpecialTokenAllowed()

Returns how special tokens will be encoded.

static Map<Bytes,Integer>

load(String path)

Loads a tiktoken model file.

int

size()

Returns the vocabulary size.

Integer

specialToken(String token)

Returns the special token id.

String[]

tokenize(String text)

Segments text into tokens.

String

tryDecode(int[] tokens)

Try to decode a list of token IDs into a string.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- ranks
  
  protected final Map<Bytes,Integer> ranks
  
  Token -> Rank
- specialTokens
  
  protected final Map<String,Integer> specialTokens
  
  Special Token -> Rank
Constructor Details
- Tiktoken
  
  public Tiktoken(Pattern pattern, Map<Bytes,Integer> ranks, String bos, String eos, String... specialTokens)
  
  Constructor.
  
  Parameters:
  
  pattern - The regex pattern to split the input text into tokens.
  
  ranks - The token to rank map.
  
  bos - The beginning of sequence token.
  
  eos - The end of sequence token.
  
  specialTokens - Optional special tokens.
Method Details
- size
  
  public int size()
  
  Returns the vocabulary size.
  
  Returns:
  
  the vocabulary size.
- allowSpecialTokens
  
  public void allowSpecialTokens(boolean allowSpecialTokens)
  
  Sets how special tokens will be encoded.
  
  Parameters:
  
  allowSpecialTokens - If false, special tokens will be encoded as natural text. Otherwise, they will be encoded as special tokens.
- isSpecialTokenAllowed
  
  public boolean isSpecialTokenAllowed()
  
  Returns how special tokens will be encoded.
  
  Returns:
  
  false if special tokens will be encoded as natural text; true if they will be encoded as special tokens.
- specialToken
  
  public Integer specialToken(String token)
  
  Returns the special token id.
  
  Parameters:
  
  token - a special token.
  
  Returns:
  
  the special token id.
- encode
  
  public int[] encode(String text)
  
  Description copied from interface: Tokenizer
  
  Encodes a string into a list of token IDs.
  
  Specified by:
  
  encode in interface Tokenizer
  
  Parameters:
  
  text - The input string to be encoded.
  
  Returns:
  
  A list of token IDs.
- encode
  
  public int[] encode(String text, boolean bos, boolean eos)
  
  Description copied from interface: Tokenizer
  
  Encodes a string into a list of token IDs.
  
  Specified by:
  
  encode in interface Tokenizer
  
  Parameters:
  
  text - The input string to be encoded.
  
  bos - Whether to prepend the beginning-of-sequence token.
  
  eos - Whether to append the end-of-sequence token.
  
  Returns:
  
  A list of token IDs.
- decode
  
  public String decode(int[] tokens)
  
  Description copied from interface: Tokenizer
  
  Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
  
  Specified by:
  
  decode in interface Tokenizer
  
  Parameters:
  
  tokens - The list of token IDs to be decoded.
  
  Returns:
  
  The decoded string.
- tryDecode
  
  public String tryDecode(int[] tokens) throws CharacterCodingException
  
  Description copied from interface: Tokenizer
  
  Try to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.
  
  Specified by:
  
  tryDecode in interface Tokenizer
  
  Parameters:
  
  tokens - The list of token IDs to be decoded.
  
  Returns:
  
  The decoded string.
  
  Throws:
  
  CharacterCodingException - If the byte sequence is not legal UTF-8.
- tokenize
  
  public String[] tokenize(String text)
  
  Description copied from interface: Tokenizer
  
  Segments text into tokens.
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Parameters:
  
  text - The input string to be tokenized.
  
  Returns:
  
  The tokenized sequence.
- load
  
  public static Map<Bytes,Integer> load(String path) throws IOException
  
  Loads a tiktoken model file.
  
  Parameters:
  
  path - The tiktoken model file path.
  
  Returns:
  
  the token -> rank map.
  
  Throws:
  
  IOException - if fail to load the model.

Class Tiktoken

Field Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Field Details

ranks

specialTokens

Constructor Details

Tiktoken

Method Details

size

allowSpecialTokens

isSpecialTokenAllowed

specialToken

encode

encode

decode

tryDecode

tokenize

load