Class Tiktoken
java.lang.Object
smile.llm.tokenizer.Tiktoken
- All Implemented Interfaces:
Tokenizer
- Direct Known Subclasses:
Tokenizer
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoidallowSpecialTokens(boolean allowSpecialTokens) Sets how special tokens will be encoded.decode(int[] tokens) Decodes a list of token IDs into a string.int[]Encodes a string into a list of token IDs.int[]Encodes a string into a list of token IDs.booleanReturns how special tokens will be encoded.Loads a tiktoken model file.intsize()Returns the vocabulary size.specialToken(String token) Returns the special token id.String[]Segments text into tokens.tryDecode(int[] tokens) Try to decode a list of token IDs into a string.
-
Field Details
-
ranks
-
specialTokens
-
-
Constructor Details
-
Tiktoken
public Tiktoken(Pattern pattern, Map<Bytes, Integer> ranks, String bos, String eos, String... specialTokens) Constructor.- Parameters:
pattern- The regex pattern to split the input text into tokens.ranks- The token to rank map.bos- The beginning of sequence token.eos- The end of sequence token.specialTokens- Optional special tokens.
-
-
Method Details
-
size
public int size()Returns the vocabulary size.- Returns:
- the vocabulary size.
-
allowSpecialTokens
public void allowSpecialTokens(boolean allowSpecialTokens) Sets how special tokens will be encoded.- Parameters:
allowSpecialTokens- If false, special tokens will be encoded as natural text. Otherwise, they will be encoded as special tokens.
-
isSpecialTokenAllowed
public boolean isSpecialTokenAllowed()Returns how special tokens will be encoded.- Returns:
- false if special tokens will be encoded as natural text; true if they will be encoded as special tokens.
-
specialToken
-
encode
-
encode
Description copied from interface:TokenizerEncodes a string into a list of token IDs. -
decode
Description copied from interface:TokenizerDecodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. -
tryDecode
Description copied from interface:TokenizerTry to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.- Specified by:
tryDecodein interfaceTokenizer- Parameters:
tokens- The list of token IDs to be decoded.- Returns:
- The decoded string.
- Throws:
CharacterCodingException- If the byte sequence is not legal UTF-8.
-
tokenize
-
load
Loads a tiktoken model file.- Parameters:
path- The tiktoken model file path.- Returns:
- the token -> rank map.
- Throws:
IOException- if fail to load the model.
-