Class TaxonomicDistance

java.lang.Object
smile.nlp.taxonomy.TaxonomicDistance
All Implemented Interfaces:
Serializable, ToDoubleBiFunction<Concept,Concept>, Distance<Concept>

public class TaxonomicDistance extends Object implements Distance<Concept>
The distance and semantic similarity between concepts in a taxonomy.

Distance

The edge-counting distance between two concepts a and b is the number of edges on the shortest path through their lowest common ancestor (LCA):
    d(a, b) = depth(a) + depth(b) − 2 × depth(LCA(a, b))

Semantic Similarity

Three widely-used similarity measures from the computational linguistics literature are provided, all returning values in [0, 1] where 1 means identical.
Wu-Palmer (wup)
Based on the depth of the LCA relative to the depths of the two concepts:
sim(a,b) = 2 × depth(LCA) / (depth(a) + depth(b))
Leacock-Chodorow (lch)
Combines edge-counting distance with the overall depth of the taxonomy:
sim(a,b) = −log(d(a,b) / (2 × H))
where H is the height of the taxonomy. The raw value is in (0, log(2H)]; it is normalized to [0, 1] by dividing by log(2H).
Lin
An information-content-based measure. When no external corpus is available, depth in the taxonomy serves as a proxy for information content: IC(c) = −log((depth(c) + 1) / (H + 1)) where H is the tree height.
sim(a,b) = 2 × IC(LCA) / (IC(a) + IC(b))
Returns 1 when a == b.

References

  1. Z. Wu and M. Palmer. Verb semantics and lexical selection. ACL, 1994.
  2. C. Leacock and M. Chodorow. Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, 1998.
  3. D. Lin. An information-theoretic definition of similarity. ICML, 1998.
See Also:
  • Constructor Details

    • TaxonomicDistance

      public TaxonomicDistance(Taxonomy taxonomy)
      Constructor.
      Parameters:
      taxonomy - the taxonomy that this distance is associated with.
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class Object
    • d

      public double d(String x, String y)
      Computes the edge-counting distance between two concepts identified by their keywords.
      Parameters:
      x - a concept keyword.
      y - the other concept keyword.
      Returns:
      the edge-counting distance.
      Throws:
      IllegalArgumentException - if either keyword is not in the taxonomy.
    • d

      public double d(Concept x, Concept y)
      Computes the edge-counting distance between two concepts.
      d(a,b) = depth(a) + depth(b) − 2 × depth(LCA(a,b))
      Specified by:
      d in interface Distance<Concept>
      Parameters:
      x - an object.
      y - an object.
      Returns:
      the distance.
    • normalizedDistance

      public double normalizedDistance(String x, String y)
      Returns the normalized edge-counting distance in [0, 1]. The raw distance is divided by the diameter of the taxonomy (the maximum possible distance between any two concepts = 2 × height). Returns 0 when the two concepts are identical, 1 when they are maximally far apart.
      Parameters:
      x - a concept keyword.
      y - the other concept keyword.
      Returns:
      the normalized distance in [0, 1].
    • wuPalmer

      public double wuPalmer(String x, String y)
      Computes the Wu-Palmer semantic similarity between two concepts.
      sim(a,b) = 2 × depth(LCA) / (depth(a) + depth(b))
      Returns 1 when the two concepts are the same, and approaches 0 as they become more distantly related.
      Parameters:
      x - a concept keyword.
      y - the other concept keyword.
      Returns:
      the Wu-Palmer similarity in (0, 1].
    • wuPalmer

      public double wuPalmer(Concept x, Concept y)
      Computes the Wu-Palmer semantic similarity between two concept nodes.
      Parameters:
      x - a concept.
      y - the other concept.
      Returns:
      the Wu-Palmer similarity in (0, 1].
    • leacockChodorow

      public double leacockChodorow(String x, String y)
      Computes the Leacock-Chodorow semantic similarity between two concepts, normalized to [0, 1].
        raw  = −log(d(a,b) / (2 × H))
        norm = raw / log(2 × H)        ∈ [0, 1]
      
      where H is the height of the taxonomy. Returns 1 when the two concepts are identical.
      Parameters:
      x - a concept keyword.
      y - the other concept keyword.
      Returns:
      the Leacock-Chodorow similarity in [0, 1].
    • leacockChodorow

      public double leacockChodorow(Concept x, Concept y)
      Computes the Leacock-Chodorow semantic similarity between two concept nodes.
      Parameters:
      x - a concept.
      y - the other concept.
      Returns:
      the Leacock-Chodorow similarity in [0, 1].
    • lin

      public double lin(String x, String y)
      Computes the Lin semantic similarity between two concepts using depth as a proxy for information content.

      Information content: IC(c) = −log((depth+1)/(H+1))

      sim(a,b) = 2 × IC(LCA) / (IC(a) + IC(b))
      Returns 1 when the two concepts are the same, and 0 when IC(a) + IC(b) == 0 (both at the root with H == 0).
      Parameters:
      x - a concept keyword.
      y - the other concept keyword.
      Returns:
      the Lin similarity in [0, 1].
    • lin

      public double lin(Concept x, Concept y)
      Computes the Lin semantic similarity between two concept nodes.
      Parameters:
      x - a concept.
      y - the other concept.
      Returns:
      the Lin similarity in [0, 1].