smile.nlp.tokenizer.SimpleTokenizer

All Implemented Interfaces:: Function<String,String[]>, Tokenizer

public class SimpleTokenizer extends Object implements Tokenizer

A word tokenizer that tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions. If a period serves as both the end of sentence and a part of abbreviation, e.g. etc. at the end of sentence, it will generate tokens of "etc." and "." while TreebankWordTokenizer will generate "etc" and ".".

Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples

children's -> children 's
parents' -> parents '
won't -> will not
can't -> can not
shan't -> shall not
cannot -> can not
weren't -> were not
'tisn't -> it is not
'tis -> it is
gonna -> gon na
I'm -> I 'm
he'll -> he 'll

This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string or before newline -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.

Constructor Summary

Constructors

Constructor

Description

SimpleTokenizer()

Constructor.

SimpleTokenizer(boolean splitContraction)

Constructor.
Method Summary

Modifier and Type

Method

Description

String[]

split(String text)

Splits the string into a list of tokens.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.function.Function
andThen, compose

Methods inherited from interface smile.nlp.tokenizer.Tokenizer
apply

Constructor Details
- SimpleTokenizer
  
  public SimpleTokenizer()
  
  Constructor.
- SimpleTokenizer
  
  public SimpleTokenizer(boolean splitContraction)
  
  Constructor.
  
  Parameters:
  
  splitContraction - if true, split adjoining words.
Method Details
- split
  
  public String[] split(String text)
  
  Description copied from interface: Tokenizer
  
  Splits the string into a list of tokens.
  
  Specified by:
  
  split in interface Tokenizer
  
  Parameters:
  
  text - the text.
  
  Returns:
  
  the tokens.

Class SimpleTokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.function.Function

Methods inherited from interface smile.nlp.tokenizer.Tokenizer

Constructor Details

SimpleTokenizer

SimpleTokenizer

Method Details

split