Class TaxonomicDistance
java.lang.Object
smile.nlp.taxonomy.TaxonomicDistance
- All Implemented Interfaces:
Serializable, ToDoubleBiFunction<Concept,Concept>, Distance<Concept>
The distance and semantic similarity between concepts in a taxonomy.
Distance
The edge-counting distance between two conceptsa and b is the
number of edges on the shortest path through their lowest common ancestor (LCA):
d(a, b) = depth(a) + depth(b) − 2 × depth(LCA(a, b))
Semantic Similarity
Three widely-used similarity measures from the computational linguistics literature are provided, all returning values in [0, 1] where 1 means identical.- Wu-Palmer (wup)
- Based on the depth of the LCA relative to the depths of the two concepts:
sim(a,b) = 2 × depth(LCA) / (depth(a) + depth(b))
- Leacock-Chodorow (lch)
- Combines edge-counting distance with the overall depth of the taxonomy:
sim(a,b) = −log(d(a,b) / (2 × H))
whereHis the height of the taxonomy. The raw value is in (0, log(2H)]; it is normalized to [0, 1] by dividing by log(2H). - Lin
- An information-content-based measure. When no external corpus is
available, depth in the taxonomy serves as a proxy for information
content:
IC(c) = −log((depth(c) + 1) / (H + 1))whereHis the tree height.sim(a,b) = 2 × IC(LCA) / (IC(a) + IC(b))
Returns 1 when a == b.
References
- Z. Wu and M. Palmer. Verb semantics and lexical selection. ACL, 1994.
- C. Leacock and M. Chodorow. Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, 1998.
- D. Lin. An information-theoretic definition of similarity. ICML, 1998.
- See Also:
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiondoubleComputes the edge-counting distance between two concepts identified by their keywords.doubleComputes the edge-counting distance between two concepts.doubleleacockChodorow(String x, String y) Computes the Leacock-Chodorow semantic similarity between two concepts, normalized to [0, 1].doubleleacockChodorow(Concept x, Concept y) Computes the Leacock-Chodorow semantic similarity between two concept nodes.doubleComputes the Lin semantic similarity between two concepts using depth as a proxy for information content.doubleComputes the Lin semantic similarity between two concept nodes.doublenormalizedDistance(String x, String y) Returns the normalized edge-counting distance in [0, 1].toString()doubleComputes the Wu-Palmer semantic similarity between two concepts.doubleComputes the Wu-Palmer semantic similarity between two concept nodes.Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface Distance
apply, applyAsDouble, pdist, pdist
-
Constructor Details
-
TaxonomicDistance
Constructor.- Parameters:
taxonomy- the taxonomy that this distance is associated with.
-
-
Method Details
-
toString
-
d
Computes the edge-counting distance between two concepts identified by their keywords.- Parameters:
x- a concept keyword.y- the other concept keyword.- Returns:
- the edge-counting distance.
- Throws:
IllegalArgumentException- if either keyword is not in the taxonomy.
-
d
-
normalizedDistance
Returns the normalized edge-counting distance in [0, 1]. The raw distance is divided by the diameter of the taxonomy (the maximum possible distance between any two concepts = 2 × height). Returns 0 when the two concepts are identical, 1 when they are maximally far apart.- Parameters:
x- a concept keyword.y- the other concept keyword.- Returns:
- the normalized distance in [0, 1].
-
wuPalmer
Computes the Wu-Palmer semantic similarity between two concepts.sim(a,b) = 2 × depth(LCA) / (depth(a) + depth(b))
Returns 1 when the two concepts are the same, and approaches 0 as they become more distantly related.- Parameters:
x- a concept keyword.y- the other concept keyword.- Returns:
- the Wu-Palmer similarity in (0, 1].
-
wuPalmer
-
leacockChodorow
Computes the Leacock-Chodorow semantic similarity between two concepts, normalized to [0, 1].raw = −log(d(a,b) / (2 × H)) norm = raw / log(2 × H) ∈ [0, 1]
whereHis the height of the taxonomy. Returns 1 when the two concepts are identical.- Parameters:
x- a concept keyword.y- the other concept keyword.- Returns:
- the Leacock-Chodorow similarity in [0, 1].
-
leacockChodorow
-
lin
Computes the Lin semantic similarity between two concepts using depth as a proxy for information content.Information content:
IC(c) = −log((depth+1)/(H+1))sim(a,b) = 2 × IC(LCA) / (IC(a) + IC(b))
Returns 1 when the two concepts are the same, and 0 when IC(a) + IC(b) == 0 (both at the root with H == 0).- Parameters:
x- a concept keyword.y- the other concept keyword.- Returns:
- the Lin similarity in [0, 1].
-
lin
-