Package smile.nlp

Natural language processing.

Functions

bag
Link copied to clipboard
fun String.bag(filter: String = "default", stemmer: Stemmer? = porter): Map<String, Int>

Returns the bag of words. The bag-of-words model is a simple representation of text as the bag of its words, disregarding grammar and word order but keeping multiplicity.

bag2
Link copied to clipboard
fun String.bag2(filter: String = "default", stemmer: Stemmer? = porter): Set<String>

Returns the binary bag of words. Presence/absence is used instead of frequencies.

bigram
Link copied to clipboard
fun bigram(p: Double, minFreq: Int, text: List<String>): Array<Bigram>

Identify bigram collocations whose p-value is less than the given threshold.

fun bigram(k: Int, minFreq: Int, text: List<String>): Array<Bigram>

Identify bigram collocations (words that often appear consecutively) within corpora. They may also be used to find other associations between word occurrences.

corpus
Link copied to clipboard
fun corpus(text: List<String>): SimpleCorpus

Creates an in-memory text corpus.

df
Link copied to clipboard
fun df(terms: List<String>, corpus: List<Map<String, Int>>): IntArray

Returns the document frequencies, i.e. the number of documents that contain term.

keywords
Link copied to clipboard
fun String.keywords(k: Int = 10): Array<NGram>

Keyword extraction from a single document using word co-occurrence statistical information.

lancaster
Link copied to clipboard
fun lancaster(word: String): String

The Paice/Husk Lancaster stemming algorithm. The stemmer is a conflation based iterative stemmer. The stemmer, although remaining efficient and easily implemented, is known to be very strong and aggressive. The stemmer utilizes a single table of rules, each of which may specify the removal or replacement of an ending.

ngram
Link copied to clipboard
fun ngram(maxNGramSize: Int, minFreq: Int, text: List<String>): Array<Array<NGram>>

An Apiori-like algorithm to extract n-gram phrases.

normalize
Link copied to clipboard
fun String.normalize(): String

Normalizes Unicode text.

porter
Link copied to clipboard
fun porter(word: String): String

Porter's stemming algorithm. The stemmer is based on the idea that the suffixes in the English language are mostly made up of a combination of smaller and simpler suffixes. This is a linear step stemmer. Specifically it has five steps applying rules within each step. Within each step, if a suffix rule matched to a word, then the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was removed, in the way defined by the rule. Once a Rule passes its conditions and is accepted the rule fires and the suffix is removed and control moves to the next step. If the rule is not accepted then the next rule in the step is tested, until either a rule from that step fires and control passes to the next step or there are no more rules in that step whence control moves to the next step.

postag
Link copied to clipboard
fun String.postag(): Array<PennTreebankPOS>

Returns the (word, part-of-speech) pairs. The text should be a single sentence.

fun postag(sentence: Array<String>): Array<PennTreebankPOS>

Part-of-speech taggers.

sentences
Link copied to clipboard
fun String.sentences(): Array<String>

Splits English text into sentences. Given an English text, it returns a list of strings, where each element is an English sentence. By default, it treats occurrences of '.', '?' and '!' as sentence delimiters, but does its best to determine when an occurrence of '.' does not have this role (e.g. in abbreviations, URLs, numbers, etc.).

tfidf
Link copied to clipboard
fun tfidf(corpus: List<DoubleArray>): List<DoubleArray>

Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

fun tfidf(bag: DoubleArray, n: Int, df: IntArray): DoubleArray

Converts a bag of words to a feature vector by TF-IDF, which is normalized to L2 norm 1.

fun tfidf(tf: Double, maxtf: Double, n: Int, df: Int): Double

TF-IDF relevance score between a term and a document based on a corpus.

vectorize
Link copied to clipboard
fun vectorize(terms: Array<String>, bag: Map<String, Int>): DoubleArray

Converts a bag of words to a feature vector.

fun vectorize(terms: List<String>, bag: Set<String>): IntArray

Converts a binary bag of words to a sparse feature vector.

words
Link copied to clipboard
fun String.words(filter: String = "default"): Array<String>

Tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions. If a period serves as both the end of sentence and a part of abbreviation, e.g. etc. at the end of sentence, it will generate tokens of "etc." and "." while TreebankWordTokenizer will generate "etc" and ".".