Interface Tokenizer
- All Known Implementing Classes:
SentencePiece, Tiktoken, Tokenizer
public interface Tokenizer
Tokenizing and encoding/decoding text.
-
Method Summary
Modifier and TypeMethodDescriptiondecode(int[] tokens) Decodes a list of token IDs into a string.int[]Encodes a string into a list of token IDs.int[]Encodes a string into a list of token IDs.static SentencePiecesentencePiece(String path) Loads a SentencePiece model.static TiktokenLoads a tiktoken model with default BOS token () and EOS token ().static TiktokenLoads a tiktoken model.String[]Segments text into tokens.default StringtryDecode(int[] tokens) Try to decode a list of token IDs into a string.
-
Method Details
-
encode
Encodes a string into a list of token IDs.- Parameters:
text- The input string to be encoded.- Returns:
- A list of token IDs.
-
encode
Encodes a string into a list of token IDs.- Parameters:
text- The input string to be encoded.bos- Whether to prepend the beginning-of-sequence token.eos- Whether to append the end-of-sequence token.- Returns:
- A list of token IDs.
-
decode
Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.- Parameters:
tokens- The list of token IDs to be decoded.- Returns:
- The decoded string.
-
tryDecode
Try to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.- Parameters:
tokens- The list of token IDs to be decoded.- Returns:
- The decoded string.
- Throws:
CharacterCodingException- If the byte sequence is not legal UTF-8.
-
tokenize
-
sentencePiece
Loads a SentencePiece model.- Parameters:
path- The SentencePiece model file path.- Returns:
- a SentencePiece tokenizer.
- Throws:
IOException- if fail to load the model.
-
tiktoken
Loads a tiktoken model with default BOS token () and EOS token ().- Parameters:
path- The tiktoken model file path.pattern- The regex pattern to split the input text into tokens.- Returns:
- a tiktoken tokenizer.
- Throws:
IOException- if fail to load the model.
-
tiktoken
static Tiktoken tiktoken(String path, Pattern pattern, String bos, String eos, String... specialTokens) throws IOException Loads a tiktoken model.- Parameters:
path- The tiktoken model file path.pattern- The regex pattern to split the input text into tokens.bos- The beginning of sequence token.eos- The end of sequence token.specialTokens- Optional special tokens.- Returns:
- a tiktoken tokenizer.
- Throws:
IOException- if fail to load the model.
-