Class SimpleTokenizer

All Implemented Interfaces:
Function<String,String[]>, Tokenizer

public class SimpleTokenizer extends Object implements Tokenizer
A word tokenizer that tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions. If a period serves as both the end of sentence and a part of abbreviation, e.g. etc. at the end of sentence, it will generate tokens of "etc." and "." while TreebankWordTokenizer will generate "etc" and ".".

Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples

  • children's -> children 's
  • parents' -> parents '
  • won't -> will not
  • can't -> can not
  • shan't -> shall not
  • cannot -> can not
  • weren't -> were not
  • 'tisn't -> it is not
  • 'tis -> it is
  • gonna -> gon na
  • I'm -> I 'm
  • he'll -> he 'll
This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string or before newline -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.
  • Constructor Details

    • SimpleTokenizer

      public SimpleTokenizer()
    • SimpleTokenizer

      public SimpleTokenizer(boolean splitContraction)
      splitContraction - if true, split adjoining words.
  • Method Details

    • split

      public String[] split(String text)
      Description copied from interface: Tokenizer
      Splits the string into a list of tokens.
      Specified by:
      split in interface Tokenizer
      text - the text.
      the tokens.