java.lang.Object

smile.llm.tokenizer.SentencePiece

All Implemented Interfaces:: Tokenizer

public class SentencePiece extends Object implements Tokenizer

SentencePiece is an unsupervised text tokenizer by Google. SentencePiece implements BPE and unigram language model.

Constructor Summary

Constructors

Constructor

Description

SentencePiece(String path)

Constructor.
Method Summary

Modifier and Type

Method

Description

String

decode(int[] tokens)

Decodes a list of token IDs into a string.

int[]

encode(String text)

Encodes a string into a list of token IDs.

int[]

encode(String text, boolean bos, boolean eos)

Encodes a string into a list of token IDs.

String[]

tokenize(String text)

Segments text into tokens.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface Tokenizer
tryDecode

Constructor Details
- SentencePiece
  
  public SentencePiece(String path) throws IOException
  
  Constructor.
  
  Parameters:
  
  path - The SentencePiece model file path.
  
  Throws:
  
  IOException - if fail to load the model.
Method Details
- encode
  
  public int[] encode(String text)
  
  Description copied from interface: Tokenizer
  
  Encodes a string into a list of token IDs.
  
  Specified by:
  
  encode in interface Tokenizer
  
  Parameters:
  
  text - The input string to be encoded.
  
  Returns:
  
  A list of token IDs.
- encode
  
  public int[] encode(String text, boolean bos, boolean eos)
  
  Description copied from interface: Tokenizer
  
  Encodes a string into a list of token IDs.
  
  Specified by:
  
  encode in interface Tokenizer
  
  Parameters:
  
  text - The input string to be encoded.
  
  bos - Whether to prepend the beginning-of-sequence token.
  
  eos - Whether to append the end-of-sequence token.
  
  Returns:
  
  A list of token IDs.
- decode
  
  public String decode(int[] tokens)
  
  Description copied from interface: Tokenizer
  
  Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
  
  Specified by:
  
  decode in interface Tokenizer
  
  Parameters:
  
  tokens - The list of token IDs to be decoded.
  
  Returns:
  
  The decoded string.
- tokenize
  
  public String[] tokenize(String text)
  
  Description copied from interface: Tokenizer
  
  Segments text into tokens.
  
  Specified by:
  
  tokenize in interface Tokenizer
  
  Parameters:
  
  text - The input string to be tokenized.
  
  Returns:
  
  The tokenized sequence.

Class SentencePiece

Constructor Summary

Method Summary

Methods inherited from class Object

Methods inherited from interface Tokenizer

Constructor Details

SentencePiece

Method Details

encode

encode

decode

tokenize