Interface CooccurrenceKeywords


public interface CooccurrenceKeywords
Keyword extraction from a single document using word co-occurrence statistical information. The algorithm was proposed by Y. Matsuo and M. Ishizuka. It consists of six steps:
  1. Stem words by Porter algorithm and extract phrases based APRIORI algorithm (upto 4 words with frequency more than 3 times). Discard stop words.
  2. Select the top frequent terms up to 30% of running terms.
  3. Clustering frequent terms. Two terms are in the same cluster if either their Jensen-Shannon divergence or mutual information is above the threshold (0.95 * log 2, and log 2, respectively).
  4. Calculate the expected co-occurrence probability
  5. Calculate the refined χ2 values that removes the maximal term.
  6. Output a given number of terms of largest refined χ2 values.
  • Method Summary

    Static Methods
    Modifier and Type
    Method
    Description
    static NGram[]
    of(String text)
    Returns the top 10 keywords.
    static NGram[]
    of(String text, int maxNumKeywords)
    Returns a given number of top keywords.
  • Method Details

    • of

      static NGram[] of(String text)
      Returns the top 10 keywords.
      Parameters:
      text - A single document.
      Returns:
      The top 10 keywords.
    • of

      static NGram[] of(String text, int maxNumKeywords)
      Returns a given number of top keywords.
      Parameters:
      text - A single document.
      maxNumKeywords - the maximum number of keywords.
      Returns:
      The top keywords.