Package smile.nlp.tokenizer


package smile.nlp.tokenizer
Sentence splitter and word tokenizer.
  • Class
    Description
    A sentence splitter based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).
    A word tokenizer based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).
    A paragraph splitter segments text into paragraphs.
    A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank.
    A sentence splitter segments text into sentences (a string of words satisfying the grammatical rules of a language).
    This is a simple paragraph splitter.
    This is a simple sentence splitter for English.
    A word tokenizer that tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions.
    A token is a string of characters, categorized according to the rules as a symbol.