Class PennTreebankTokenizer

java.lang.Object
smile.nlp.tokenizer.PennTreebankTokenizer
All Implemented Interfaces:
Function<String,String[]>, Tokenizer

public class PennTreebankTokenizer extends Object implements Tokenizer
A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank. Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples
  • children's -> children 's
  • parents' -> parents '
  • won't --> wo n't
  • can't -> ca n't
  • weren't -> were n't
  • cannot -> can not
  • 'tisn't -> 't is n't
  • 'tis -> 't is
  • gonna -> gon na
  • I'm -> I 'm
  • he'll -> he 'll
This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string or before newline -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.
  • Method Details

    • getInstance

      public static PennTreebankTokenizer getInstance()
      Returns the singleton instance.
      Returns:
      the singleton instance.
    • split

      public String[] split(String text)
      Description copied from interface: Tokenizer
      Splits the string into a list of tokens.
      Specified by:
      split in interface Tokenizer
      Parameters:
      text - the text.
      Returns:
      the tokens.