Class Tokenizer

All Implemented Interfaces:
Tokenizer

public class Tokenizer extends Tiktoken
Custom tokenizer for Llama 3 models.
  • Constructor Details

    • Tokenizer

      public Tokenizer(Map<Bytes,Integer> ranks)
      Constructor with default BOS, EOS, and special tokens.
      Parameters:
      ranks - The token to rank map.
    • Tokenizer

      public Tokenizer(Map<Bytes,Integer> ranks, String bos, String eos, String... specialTokens)
      Constructor.
      Parameters:
      ranks - The token to id map.
      bos - beginning of sequence token.
      eos - end of sequence token.
      specialTokens - Optional special tokens.
  • Method Details

    • encodeMessage

      public int[] encodeMessage(Message message)
      Encodes a message.
      Parameters:
      message - the message.
      Returns:
      the tokens.
    • encodeDialog

      public int[] encodeDialog(Message... dialog)
      Encodes the messages of a dialog.
      Parameters:
      dialog - the messages.
      Returns:
      the tokens.
    • of

      public static Tokenizer of(String path) throws IOException
      Loads a llama3 tokenizer model.
      Parameters:
      path - The llama3 model file path.
      Returns:
      a llama3 tokenizer.
      Throws:
      IOException - if fail to load the model.