words

fun String.words(filter: String = "default"): Array<String>

Tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions. If a period serves as both the end of sentence and a part of abbreviation, e.g. etc. at the end of sentence, it will generate tokens of "etc." and "." while TreebankWordTokenizer will generate "etc" and ".".

Most punctuation is split from adjoining words. Verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately.

This tokenizer assumes that the text has already been segmented into sentences. Any periods -- apart from those at the end of a string or before newline -- are assumed to be part of the word they are attached to (e.g. for abbreviations, etc), and are not separately tokenized.

If the parameter filter is not "none", the method will also filter out stop words and punctuations. There is no definite list of stop words which all tools incorporate. The valid values of the parameter filter include

  • "none": no filtering

  • "default": the default English stop word list

  • "comprehensive": a more comprehensive English stop word list

  • "google": the stop words list used by Google search engine

  • "mysql": the stop words list used by MySQL FullText feature

  • custom stop word list: comma separated stop word list