Package smile.llm.tokenizer
Interface Tokenizer
- All Known Implementing Classes:
SentencePiece
,Tiktoken
,Tokenizer
public interface Tokenizer
Tokenizing and encoding/decoding text.
-
Method Summary
Modifier and TypeMethodDescriptiondecode
(int[] tokens) Decodes a list of token IDs into a string.int[]
Encodes a string into a list of token IDs.int[]
Encodes a string into a list of token IDs.static SentencePiece
sentencePiece
(String path) Loads a SentencePiece model.static Tiktoken
Loads a tiktoken model with default BOS token () and EOS token ().static Tiktoken
Loads a tiktoken model.String[]
Segments text into tokens.default String
tryDecode
(int[] tokens) Try to decode a list of token IDs into a string.
-
Method Details
-
encode
Encodes a string into a list of token IDs.- Parameters:
text
- The input string to be encoded.- Returns:
- A list of token IDs.
-
encode
Encodes a string into a list of token IDs.- Parameters:
text
- The input string to be encoded.bos
- Whether to prepend the beginning-of-sequence token.eos
- Whether to append the end-of-sequence token.- Returns:
- A list of token IDs.
-
decode
Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.- Parameters:
tokens
- The list of token IDs to be decoded.- Returns:
- The decoded string.
-
tryDecode
Try to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.- Parameters:
tokens
- The list of token IDs to be decoded.- Returns:
- The decoded string.
- Throws:
CharacterCodingException
- If the byte sequence is not legal UTF-8.
-
tokenize
Segments text into tokens.- Parameters:
text
- The input string to be tokenized.- Returns:
- The tokenized sequence.
-
sentencePiece
Loads a SentencePiece model.- Parameters:
path
- The SentencePiece model file path.- Returns:
- a SentencePiece tokenizer.
- Throws:
IOException
- if fail to load the model.
-
tiktoken
Loads a tiktoken model with default BOS token () and EOS token ().- Parameters:
path
- The tiktoken model file path.- Returns:
- a tiktoken tokenizer.
- Throws:
IOException
- if fail to load the model.
-
tiktoken
static Tiktoken tiktoken(String path, Pattern pattern, String bos, String eos, String... specialTokens) throws IOException Loads a tiktoken model.- Parameters:
path
- The tiktoken model file path.bos
- beginning of sequence token.eos
- end of sequence token.specialTokens
- Optional special tokens.- Returns:
- a tiktoken tokenizer.
- Throws:
IOException
- if fail to load the model.
-