Interface Tokenizer

All Superinterfaces:
Function<String,String[]>
All Known Implementing Classes:
BreakIteratorTokenizer, PennTreebankTokenizer, SimpleTokenizer

public interface Tokenizer extends Function<String,String[]>
A token is a string of characters, categorized according to the rules as a symbol. The process of forming tokens from an input stream of characters is called tokenization.

This is not as easy as it sounds. For example, when should a token containing a hypen be split into two or more tokens? When does a period indicate the end of an abbreviation as opposed to a sentence or a number or a Roman numeral? Sometimes a period can act as a sentence terminator and an abbreviation terminator at the same time. When should a single quote be split from a word?

  • Method Summary

    Modifier and Type
    Method
    Description
    default String[]
    apply(String text)
     
    split(String text)
    Splits the string into a list of tokens.

    Methods inherited from interface java.util.function.Function

    andThen, compose
  • Method Details