Interface Tokenizer

All Known Implementing Classes:
SentencePiece, Tiktoken, Tokenizer

public interface Tokenizer
Tokenizing and encoding/decoding text.
  • Method Details

    • encode

      int[] encode(String text)
      Encodes a string into a list of token IDs.
      Parameters:
      text - The input string to be encoded.
      Returns:
      A list of token IDs.
    • encode

      int[] encode(String text, boolean bos, boolean eos)
      Encodes a string into a list of token IDs.
      Parameters:
      text - The input string to be encoded.
      bos - Whether to prepend the beginning-of-sequence token.
      eos - Whether to append the end-of-sequence token.
      Returns:
      A list of token IDs.
    • decode

      String decode(int[] tokens)
      Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
      Parameters:
      tokens - The list of token IDs to be decoded.
      Returns:
      The decoded string.
    • tryDecode

      default String tryDecode(int[] tokens) throws CharacterCodingException
      Try to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.
      Parameters:
      tokens - The list of token IDs to be decoded.
      Returns:
      The decoded string.
      Throws:
      CharacterCodingException - If the byte sequence is not legal UTF-8.
    • tokenize

      String[] tokenize(String text)
      Segments text into tokens.
      Parameters:
      text - The input string to be tokenized.
      Returns:
      The tokenized sequence.
    • sentencePiece

      static SentencePiece sentencePiece(String path) throws IOException
      Loads a SentencePiece model.
      Parameters:
      path - The SentencePiece model file path.
      Returns:
      a SentencePiece tokenizer.
      Throws:
      IOException - if fail to load the model.
    • tiktoken

      static Tiktoken tiktoken(String path, Pattern pattern) throws IOException
      Loads a tiktoken model with default BOS token () and EOS token ().
      Parameters:
      path - The tiktoken model file path.
      Returns:
      a tiktoken tokenizer.
      Throws:
      IOException - if fail to load the model.
    • tiktoken

      static Tiktoken tiktoken(String path, Pattern pattern, String bos, String eos, String... specialTokens) throws IOException
      Loads a tiktoken model.
      Parameters:
      path - The tiktoken model file path.
      bos - beginning of sequence token.
      eos - end of sequence token.
      specialTokens - Optional special tokens.
      Returns:
      a tiktoken tokenizer.
      Throws:
      IOException - if fail to load the model.