Record Class NGram
java.lang.Object
java.lang.Record
smile.nlp.NGram
- Record Components:
words- the word sequence.count- the total number of occurrences of n-gram in the corpus.
- All Implemented Interfaces:
Comparable<NGram>
An n-gram is a contiguous sequence of n words from a given sequence of text.
An n-gram of size 1 is referred to as a unigram; size 2 is a bigram;
size 3 is a trigram.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic NGram[][]apriori(Collection<String[]> sentences, int maxNGramSize, int minFrequency) Extracts n-gram phrases by an Apriori-like algorithm.intintcount()Returns the value of thecountrecord component.booleanIndicates whether some other object is "equal to" this one.inthashCode()Returns a hash code value for this object.toString()Returns a string representation of this record class.String[]words()Returns the value of thewordsrecord component.
-
Constructor Details
-
NGram
-
NGram
-
-
Method Details
-
toString
-
hashCode
-
equals
Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. Reference components are compared withObjects::equals(Object,Object); primitive components are compared with thecomparemethod from their corresponding wrapper classes. -
compareTo
- Specified by:
compareToin interfaceComparable<NGram>
-
apriori
Extracts n-gram phrases by an Apriori-like algorithm. The algorithm was proposed in "A Study Using n-gram Features for Text Categorization" by Johannes Furnkranz.The algorithm takes a collection of sentences and generates all n-grams of length at most maxNGramSize that occur at least minFrequency times in the sentences.
- Parameters:
sentences- A collection of sentences (already split).maxNGramSize- The maximum length of n-gramminFrequency- The minimum frequency of n-gram in the sentences.- Returns:
- An array of n-gram sets. The i-th entry is the set of i-grams.
-
words
-
count
-