Interface Text

All Known Subinterfaces:
Document
All Known Implementing Classes:
SimpleDocument

public interface Text
A minimal interface of text.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final double
    Clustering threshold: two terms are merged into one cluster when their squared geometric average co-occurrence probability is at or above this value.
    static final double
    Fraction of all distinct phrases kept as "frequent terms" for the co-occurrence analysis (top 30%).
    static final int
    Maximum n-gram length considered by the Apriori phrase extraction step.
    static final int
    Minimum n-gram frequency required for a phrase to be retained.
  • Method Summary

    Modifier and Type
    Method
    Description
    Returns the text content.
    default List<NGram>
    keywords(int maxNumKeywords)
    Extracts the top maxNumKeywords keywords of the document using word co-occurrence statistical information.
    static Text
    of(String content)
    Creates a text instance without title.
    static Text
    of(String title, String content)
    Creates a text instance.
    Returns the title of text, if there is one.
  • Field Details

    • MAX_NGRAM_SIZE

      static final int MAX_NGRAM_SIZE
      Maximum n-gram length considered by the Apriori phrase extraction step.
      See Also:
    • MIN_NGRAM_FREQ

      static final int MIN_NGRAM_FREQ
      Minimum n-gram frequency required for a phrase to be retained.
      See Also:
    • FREQ_TERM_RATIO

      static final double FREQ_TERM_RATIO
      Fraction of all distinct phrases kept as "frequent terms" for the co-occurrence analysis (top 30%).
      See Also:
    • CLUSTERING_THRESHOLD

      static final double CLUSTERING_THRESHOLD
      Clustering threshold: two terms are merged into one cluster when their squared geometric average co-occurrence probability is at or above this value.
      See Also:
  • Method Details

    • title

      String title()
      Returns the title of text, if there is one.
      Returns:
      the title of text, if there is one.
    • content

      String content()
      Returns the text content.
      Returns:
      the text content.
    • of

      static Text of(String content)
      Creates a text instance without title.
      Parameters:
      content - the text content.
      Returns:
      a text instance with empty title.
    • of

      static Text of(String title, String content)
      Creates a text instance.
      Parameters:
      title - the text title.
      content - the text content.
      Returns:
      a text instance.
    • keywords

      default List<NGram> keywords(int maxNumKeywords)
      Extracts the top maxNumKeywords keywords of the document using word co-occurrence statistical information. Keywords or keyphrases capture the primary topics discussed in the text. The algorithm was proposed by Matsuo & Ishizuka (2004): Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information, AAAI 2004. It consists of six steps:
      1. Stem words by Porter algorithm and extract phrases based on an Apriori-like algorithm (up to 4 words with frequency at least 4). Discard stop words.
      2. Select the top-frequent terms (up to 0.3 of running terms).
      3. Cluster frequent terms. Two terms are placed in the same cluster when the squared geometric average of their co-occurrence probability exceeds 0.25.
      4. Calculate the expected co-occurrence probability per cluster.
      5. Calculate the refined χ² score for each term.
      6. Return the top-maxNumKeywords terms by χ² score, suppressing sub-phrase duplicates within the same cluster.
      Parameters:
      maxNumKeywords - the maximum number of keywords to return; must be positive.
      Returns:
      the top keywords, possibly fewer if the document is too short.
      Throws:
      IllegalArgumentException - if text is null or blank, or if maxNumKeywords is not positive.
      See Also: