Class SimpleSentenceSplitter

java.lang.Object
smile.nlp.tokenizer.SimpleSentenceSplitter
All Implemented Interfaces:
SentenceSplitter

public class SimpleSentenceSplitter extends Object implements SentenceSplitter
This is a simple sentence splitter for English. Given a string, assumed to be English text, it returns a list of strings, where each element is an English sentence. By default, it treats occurrences of '.', '?' and '!' as sentence delimiters, but does its best to determine when an occurrence of '.' does not have this role (e.g. in abbreviations, URLs, numbers, etc.).

Recognizing the end of a sentence is not an easy task for a computer. In English, punctuation marks that usually appear at the end of a sentence may not indicate the end of a sentence. The period is the worst offender. A period can end a sentence but it can also be part of an abbreviation or acronym, an ellipsis, a decimal number, or part of a bracket of periods surrounding a Roman numeral. A period can even act both as the end of an abbreviation and the end of a sentence at the same time. Other the other hand, some poems may not contain any sentence punctuation at all.

Another problem punctuation mark is the single quote, which can introduce a quote or start a contraction such as 'tis. Leading-quote contractions are uncommon in contemporary English texts, but appear frequently in Early Modern English texts.

This tokenizer assumes that the text has already been segmented into paragraphs. Any carriage returns will be replaced by whitespace.

References

  1. Paul Clough. A Perl program for sentence splitting using rules.
  • Method Details

    • getInstance

      public static SimpleSentenceSplitter getInstance()
      Returns the singleton instance.
      Returns:
      the singleton instance.
    • split

      public String[] split(String text)
      Description copied from interface: SentenceSplitter
      Splits the text into sentences.
      Specified by:
      split in interface SentenceSplitter
      Parameters:
      text - the text.
      Returns:
      the sentences.