smile.nlp.tokenizer.PennTreebankTokenizer

All Implemented Interfaces:: Function<String,String[]>, Tokenizer

public class PennTreebankTokenizer extends Object implements Tokenizer

A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank. Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples

children's -> children 's
parents' -> parents '
won't --> wo n't
can't -> ca n't
weren't -> were n't
cannot -> can not
'tisn't -> 't is n't
'tis -> 't is
gonna -> gon na
I'm -> I 'm
he'll -> he 'll

This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string or before newline -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc.), and are not separately tokenized.

Method Summary

Modifier and Type

Method

Description

static PennTreebankTokenizer

getInstance()

Returns the singleton instance.

String[]

split(String text)

Splits the string into a list of tokens.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.function.Function
andThen, compose

Methods inherited from interface smile.nlp.tokenizer.Tokenizer
apply

Method Details
- getInstance
  
  public static PennTreebankTokenizer getInstance()
  
  Returns the singleton instance.
  
  Returns:
  
  the singleton instance.
- split
  
  public String[] split(String text)
  
  Description copied from interface: Tokenizer
  
  Splits the string into a list of tokens.
  
  Specified by:
  
  split in interface Tokenizer
  
  Parameters:
  
  text - the text.
  
  Returns:
  
  the tokens.

Class PennTreebankTokenizer

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.function.Function

Methods inherited from interface smile.nlp.tokenizer.Tokenizer

Method Details

getInstance

split