Interface Text
- All Known Subinterfaces:
Document
- All Known Implementing Classes:
SimpleDocument
public interface Text
A minimal interface of text.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final doubleClustering threshold: two terms are merged into one cluster when their squared geometric average co-occurrence probability is at or above this value.static final doubleFraction of all distinct phrases kept as "frequent terms" for the co-occurrence analysis (top 30%).static final intMaximum n-gram length considered by the Apriori phrase extraction step.static final intMinimum n-gram frequency required for a phrase to be retained. -
Method Summary
Modifier and TypeMethodDescriptioncontent()Returns the text content.keywords(int maxNumKeywords) Extracts the topmaxNumKeywordskeywords of the document using word co-occurrence statistical information.static TextCreates a text instance without title.static TextCreates a text instance.title()Returns the title of text, if there is one.
-
Field Details
-
MAX_NGRAM_SIZE
static final int MAX_NGRAM_SIZEMaximum n-gram length considered by the Apriori phrase extraction step.- See Also:
-
MIN_NGRAM_FREQ
static final int MIN_NGRAM_FREQMinimum n-gram frequency required for a phrase to be retained.- See Also:
-
FREQ_TERM_RATIO
static final double FREQ_TERM_RATIOFraction of all distinct phrases kept as "frequent terms" for the co-occurrence analysis (top 30%).- See Also:
-
CLUSTERING_THRESHOLD
static final double CLUSTERING_THRESHOLDClustering threshold: two terms are merged into one cluster when their squared geometric average co-occurrence probability is at or above this value.- See Also:
-
-
Method Details
-
title
String title()Returns the title of text, if there is one.- Returns:
- the title of text, if there is one.
-
content
-
of
-
of
-
keywords
Extracts the topmaxNumKeywordskeywords of the document using word co-occurrence statistical information. Keywords or keyphrases capture the primary topics discussed in the text. The algorithm was proposed by Matsuo & Ishizuka (2004): Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information, AAAI 2004. It consists of six steps:- Stem words by Porter algorithm and extract phrases based on an Apriori-like algorithm (up to 4 words with frequency at least 4). Discard stop words.
- Select the top-frequent terms (up to 0.3 of running terms).
- Cluster frequent terms. Two terms are placed in the same cluster when the squared geometric average of their co-occurrence probability exceeds 0.25.
- Calculate the expected co-occurrence probability per cluster.
- Calculate the refined χ² score for each term.
- Return the top-
maxNumKeywordsterms by χ² score, suppressing sub-phrase duplicates within the same cluster.
- Parameters:
maxNumKeywords- the maximum number of keywords to return; must be positive.- Returns:
- the top keywords, possibly fewer if the document is too short.
- Throws:
IllegalArgumentException- iftextisnullor blank, or ifmaxNumKeywordsis not positive.- See Also:
-