smile.nlp

package smile.nlp

Natural language processing.

Attributes

Members list

Type members

Classlikes

object $dummy

Hacking scaladoc issue-8124. The user should ignore this object.

Hacking scaladoc issue-8124. The user should ignore this object.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
$dummy.type

Value members

Concrete methods

def bigram(k: Int, minFreq: Int, text: String*): Array[Bigram]

Identify bigram collocations (words that often appear consecutively) within corpora. They may also be used to find other associations between word occurrences.

Identify bigram collocations (words that often appear consecutively) within corpora. They may also be used to find other associations between word occurrences.

Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then require filtering to only retain useful content terms. Each n-gram of words may then be scored according to some association measure, in order to determine the relative likelihood of each n-gram being a collocation.

Value parameters

k

finds top k bigram.

minFreq

the minimum frequency of collocation.

text

input text.

Attributes

Returns

significant bigram collocations in descending order of likelihood ratio.

def bigram(p: Double, minFreq: Int, text: String*): Array[Bigram]

Identify bigram collocations whose p-value is less than the given threshold.

Identify bigram collocations whose p-value is less than the given threshold.

Value parameters

minFreq

the minimum frequency of collocation.

p

the p-value threshold

text

input text.

Attributes

Returns

significant bigram collocations in descending order of likelihood ratio.

def corpus(text: Seq[String]): SimpleCorpus

Creates an in-memory text corpus.

Creates an in-memory text corpus.

Value parameters

text

a set of text.

Attributes

def df(terms: Array[String], corpus: Array[Map[String, Int]]): Array[Int]

Returns the document frequencies, i.e. the number of documents that contain term.

Returns the document frequencies, i.e. the number of documents that contain term.

Value parameters

corpus

the training corpus.

terms

the token list used as features.

Attributes

Returns

the array of document frequencies.

def ngram(maxNGramSize: Int, minFreq: Int, text: String*): Array[Array[NGram]]

An Apiori-like algorithm to extract n-gram phrases.

An Apiori-like algorithm to extract n-gram phrases.

Value parameters

maxNGramSize

The maximum length of n-gram

minFreq

The minimum frequency of n-gram in the sentences.

text

input text.

Attributes

Returns

An array of sets of n-grams. The i-th entry is the set of i-grams.

def postag(sentence: Array[String]): Array[PennTreebankPOS]

Part-of-speech taggers.

Part-of-speech taggers.

Value parameters

sentence

a sentence that is already segmented to words.

Attributes

Returns

the pos tags.

def tfidf(corpus: Seq[Array[Double]]): Array[Array[Double]]

Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

Value parameters

corpus

the corpus of documents in bag-of-words representation.

Attributes

Returns

a matrix of which each row is the TF-IDF feature vector.

def tfidf(corpus: Array[Array[Double]]): Array[Array[Double]]

Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

Converts a corpus to TF-IDF feature vectors, which are normalized to L2 norm 1.

Value parameters

corpus

the corpus of documents in bag-of-words representation.

Attributes

Returns

a matrix of which each row is the TF-IDF feature vector.

def tfidf(bag: Array[Double], n: Int, df: Array[Int]): Array[Double]

Converts a bag of words to a feature vector by TF-IDF, which is normalized to L2 norm 1.

Converts a bag of words to a feature vector by TF-IDF, which is normalized to L2 norm 1.

Value parameters

bag

the bag-of-words feature vector of a document.

df

the number of documents containing the given term in the corpus.

n

the number of documents in training corpus.

Attributes

Returns

TF-IDF feature vector

def vectorize(terms: Array[String], bag: Map[String, Int]): Array[Double]

Converts a bag of words to a feature vector.

Converts a bag of words to a feature vector.

Value parameters

bag

the bag of words.

terms

the token list used as features.

Attributes

Returns

a vector of frequency of feature tokens in the bag.

def vectorize(terms: Array[String], bag: Set[String]): Array[Int]

Converts a binary bag of words to a sparse feature vector.

Converts a binary bag of words to a sparse feature vector.

Value parameters

bag

the bag of words.

terms

the token list used as features.

Attributes

Returns

an integer vector, which elements are the indices of presented feature tokens in ascending order.

Concrete fields

val lancaster: LancasterStemmer

The Paice/Husk Lancaster stemming algorithm. The stemmer is a conflation based iterative stemmer. The stemmer, although remaining efficient and easily implemented, is known to be very strong and aggressive. The stemmer utilizes a single table of rules, each of which may specify the removal or replacement of an ending.

The Paice/Husk Lancaster stemming algorithm. The stemmer is a conflation based iterative stemmer. The stemmer, although remaining efficient and easily implemented, is known to be very strong and aggressive. The stemmer utilizes a single table of rules, each of which may specify the removal or replacement of an ending.

Attributes

val porter: PorterStemmer

Porter's stemming algorithm. The stemmer is based on the idea that the suffixes in the English language are mostly made up of a combination of smaller and simpler suffixes. This is a linear step stemmer. Specifically it has five steps applying rules within each step. Within each step, if a suffix rule matched to a word, then the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was removed, in the way defined by the rule. Once a Rule passes its conditions and is accepted the rule fires and the suffix is removed and control moves to the next step. If the rule is not accepted then the next rule in the step is tested, until either a rule from that step fires and control passes to the next step or there are no more rules in that step whence control moves to the next step.

Porter's stemming algorithm. The stemmer is based on the idea that the suffixes in the English language are mostly made up of a combination of smaller and simpler suffixes. This is a linear step stemmer. Specifically it has five steps applying rules within each step. Within each step, if a suffix rule matched to a word, then the conditions attached to that rule are tested on what would be the resulting stem, if that suffix was removed, in the way defined by the rule. Once a Rule passes its conditions and is accepted the rule fires and the suffix is removed and control moves to the next step. If the rule is not accepted then the next rule in the step is tested, until either a rule from that step fires and control passes to the next step or there are no more rules in that step whence control moves to the next step.

Attributes

Implicits

Implicits

implicit def pimpString(string: String): PimpedString