Class Tiktoken

java.lang.Object
smile.llm.tokenizer.Tiktoken
All Implemented Interfaces:
Tokenizer
Direct Known Subclasses:
Tokenizer

public class Tiktoken extends Object implements Tokenizer
tiktoken is a fast BPE tokenizer by OpenAI.
  • Field Details

  • Constructor Details

    • Tiktoken

      public Tiktoken(Pattern pattern, Map<Bytes,Integer> ranks, String bos, String eos, String... specialTokens)
      Constructor.
      Parameters:
      pattern - The regex pattern to split the input text into tokens.
      ranks - The token to rank map.
      bos - The beginning of sequence token.
      eos - The end of sequence token.
      specialTokens - Optional special tokens.
  • Method Details

    • size

      public int size()
      Returns the vocabulary size.
      Returns:
      the vocabulary size.
    • allowSpecialTokens

      public void allowSpecialTokens(boolean allowSpecialTokens)
      Sets how special tokens will be encoded.
      Parameters:
      allowSpecialTokens - If false, special tokens will be encoded as natural text. Otherwise, they will be encoded as special tokens.
    • isSpecialTokenAllowed

      public boolean isSpecialTokenAllowed()
      Returns how special tokens will be encoded.
      Returns:
      false if special tokens will be encoded as natural text; true if they will be encoded as special tokens.
    • specialToken

      public Integer specialToken(String token)
      Returns the special token id.
      Parameters:
      token - a special token.
      Returns:
      the special token id.
    • encode

      public int[] encode(String text)
      Description copied from interface: Tokenizer
      Encodes a string into a list of token IDs.
      Specified by:
      encode in interface Tokenizer
      Parameters:
      text - The input string to be encoded.
      Returns:
      A list of token IDs.
    • encode

      public int[] encode(String text, boolean bos, boolean eos)
      Description copied from interface: Tokenizer
      Encodes a string into a list of token IDs.
      Specified by:
      encode in interface Tokenizer
      Parameters:
      text - The input string to be encoded.
      bos - Whether to prepend the beginning-of-sequence token.
      eos - Whether to append the end-of-sequence token.
      Returns:
      A list of token IDs.
    • decode

      public String decode(int[] tokens)
      Description copied from interface: Tokenizer
      Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
      Specified by:
      decode in interface Tokenizer
      Parameters:
      tokens - The list of token IDs to be decoded.
      Returns:
      The decoded string.
    • tryDecode

      public String tryDecode(int[] tokens) throws CharacterCodingException
      Description copied from interface: Tokenizer
      Try to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.
      Specified by:
      tryDecode in interface Tokenizer
      Parameters:
      tokens - The list of token IDs to be decoded.
      Returns:
      The decoded string.
      Throws:
      CharacterCodingException - If the byte sequence is not legal UTF-8.
    • tokenize

      public String[] tokenize(String text)
      Description copied from interface: Tokenizer
      Segments text into tokens.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      text - The input string to be tokenized.
      Returns:
      The tokenized sequence.
    • load

      public static Map<Bytes,Integer> load(String path) throws IOException
      Loads a tiktoken model file.
      Parameters:
      path - The tiktoken model file path.
      Returns:
      the token -> rank map.
      Throws:
      IOException - if fail to load the model.