Package smile.nlp.tokenizer
Class SimpleTokenizer
java.lang.Object
smile.nlp.tokenizer.SimpleTokenizer
A word tokenizer that tokenizes English sentences with some differences from
TreebankWordTokenizer, notably on handling not-contractions. If a period
serves as both the end of sentence and a part of abbreviation, e.g. etc. at
the end of sentence, it will generate tokens of "etc." and "." while
TreebankWordTokenizer will generate "etc" and ".".
Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. Examples
- children's -> children 's
- parents' -> parents '
- won't -> will not
- can't -> can not
- shan't -> shall not
- cannot -> can not
- weren't -> were not
- 'tisn't -> it is not
- 'tis -> it is
- gonna -> gon na
- I'm -> I 'm
- he'll -> he 'll
-
Constructor Details
-
SimpleTokenizer
public SimpleTokenizer()Constructor. -
SimpleTokenizer
public SimpleTokenizer(boolean splitContraction) Constructor.- Parameters:
splitContraction
- if true, split adjoining words.
-
-
Method Details