Package smile.llm.tokenizer
Class SentencePiece
java.lang.Object
smile.llm.tokenizer.SentencePiece
- All Implemented Interfaces:
Tokenizer
SentencePiece is an unsupervised text tokenizer by Google.
SentencePiece implements BPE and unigram language model.
-
Constructor Summary
-
Method Summary
-
Constructor Details
-
SentencePiece
Constructor.- Parameters:
path
- The SentencePiece model file path.- Throws:
IOException
- if fail to load the model.
-
-
Method Details
-
encode
Description copied from interface:Tokenizer
Encodes a string into a list of token IDs. -
encode
Description copied from interface:Tokenizer
Encodes a string into a list of token IDs. -
decode
Description copied from interface:Tokenizer
Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. -
tokenize
Description copied from interface:Tokenizer
Segments text into tokens.
-