Class SentencePiece

java.lang.Object
smile.llm.tokenizer.SentencePiece
All Implemented Interfaces:
Tokenizer

public class SentencePiece extends Object implements Tokenizer
SentencePiece is an unsupervised text tokenizer by Google. SentencePiece implements BPE and unigram language model.
  • Constructor Details

    • SentencePiece

      public SentencePiece(String path) throws IOException
      Constructor.
      Parameters:
      path - The SentencePiece model file path.
      Throws:
      IOException - if fail to load the model.
  • Method Details

    • encode

      public int[] encode(String text)
      Description copied from interface: Tokenizer
      Encodes a string into a list of token IDs.
      Specified by:
      encode in interface Tokenizer
      Parameters:
      text - The input string to be encoded.
      Returns:
      A list of token IDs.
    • encode

      public int[] encode(String text, boolean bos, boolean eos)
      Description copied from interface: Tokenizer
      Encodes a string into a list of token IDs.
      Specified by:
      encode in interface Tokenizer
      Parameters:
      text - The input string to be encoded.
      bos - Whether to prepend the beginning-of-sequence token.
      eos - Whether to append the end-of-sequence token.
      Returns:
      A list of token IDs.
    • decode

      public String decode(int[] tokens)
      Description copied from interface: Tokenizer
      Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
      Specified by:
      decode in interface Tokenizer
      Parameters:
      tokens - The list of token IDs to be decoded.
      Returns:
      The decoded string.
    • tokenize

      public String[] tokenize(String text)
      Description copied from interface: Tokenizer
      Segments text into tokens.
      Specified by:
      tokenize in interface Tokenizer
      Parameters:
      text - The input string to be tokenized.
      Returns:
      The tokenized sequence.