Package smile.llm.tokenizer
Class Tiktoken
java.lang.Object
smile.llm.tokenizer.Tiktoken
- All Implemented Interfaces:
Tokenizer
- Direct Known Subclasses:
Tokenizer
tiktoken is a fast BPE tokenizer by OpenAI.
-
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
allowSpecialTokens
(boolean allowSpecialTokens) Sets how special tokens will be encoded.decode
(int[] tokens) Decodes a list of token IDs into a string.int[]
Encodes a string into a list of token IDs.int[]
Encodes a string into a list of token IDs.boolean
Returns how special tokens will be encoded.Loads a tiktoken model file.int
size()
Returns the vocabulary size.specialToken
(String token) Returns the special token id.String[]
Segments text into tokens.tryDecode
(int[] tokens) Try to decode a list of token IDs into a string.
-
Field Details
-
ranks
Token -> Rank -
specialTokens
Special Token -> Rank
-
-
Constructor Details
-
Tiktoken
public Tiktoken(Pattern pattern, Map<Bytes, Integer> ranks, String bos, String eos, String... specialTokens) Constructor.- Parameters:
pattern
- The regex pattern to split the input text into tokens.ranks
- The token to rank map.bos
- The beginning of sequence token.eos
- The end of sequence token.specialTokens
- Optional special tokens.
-
-
Method Details
-
size
public int size()Returns the vocabulary size.- Returns:
- the vocabulary size.
-
allowSpecialTokens
public void allowSpecialTokens(boolean allowSpecialTokens) Sets how special tokens will be encoded.- Parameters:
allowSpecialTokens
- If false, special tokens will be encoded as natural text. Otherwise, they will be encoded as special tokens.
-
isSpecialTokenAllowed
public boolean isSpecialTokenAllowed()Returns how special tokens will be encoded.- Returns:
- false if special tokens will be encoded as natural text; true if they will be encoded as special tokens.
-
specialToken
Returns the special token id.- Parameters:
token
- a special token.- Returns:
- the special token id.
-
encode
Description copied from interface:Tokenizer
Encodes a string into a list of token IDs. -
encode
Description copied from interface:Tokenizer
Encodes a string into a list of token IDs. -
decode
Description copied from interface:Tokenizer
Decodes a list of token IDs into a string. Note that a token may contain only partial bytes of a character. This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. -
tryDecode
Description copied from interface:Tokenizer
Try to decode a list of token IDs into a string. This method throws CharacterCodingException if the byte sequence is not legal UTF-8.- Specified by:
tryDecode
in interfaceTokenizer
- Parameters:
tokens
- The list of token IDs to be decoded.- Returns:
- The decoded string.
- Throws:
CharacterCodingException
- If the byte sequence is not legal UTF-8.
-
tokenize
Description copied from interface:Tokenizer
Segments text into tokens. -
load
Loads a tiktoken model file.- Parameters:
path
- The tiktoken model file path.- Returns:
- the token -> rank map.
- Throws:
IOException
- if fail to load the model.
-