Class EditDistance

java.lang.Object
smile.math.distance.EditDistance
All Implemented Interfaces:
Serializable, ToDoubleBiFunction<String,String>, Distance<String>, Metric<String>

public class EditDistance extends Object implements Metric<String>
The Edit distance between two strings is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau-Levenshtein distance) allows the transposition of two characters as an operation.

Given two strings x and y of length m and n (suppose n >= m), this implementation takes O(ne) time and O(mn) space by an extended Ukkonen's algorithm in case of unit cost, where e is the edit distance between x and y. Thus, this algorithm is output sensitive. The smaller the distance, the faster it runs.

For weighted cost, this implements the regular dynamic programming algorithm, which takes O(mn) time and O(m) space.

See Also:
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor.
    EditDistance(boolean damerau)
    Constructor.
    EditDistance(int maxStringLength)
    Constructor.
    EditDistance(int[][] weight)
    Constructor.
    EditDistance(int[][] weight, double radius)
    Constructor.
    EditDistance(int maxStringLength, boolean damerau)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    double
    d(char[] a, char[] b)
    Edit distance between two strings.
    double
    d(String a, String b)
    Edit distance between two strings.
    static int
    damerau(char[] a, char[] b)
    Damerau-Levenshtein distance between two strings allows insertion, deletion, substitution, or transposition of characters.
    static int
    Damerau-Levenshtein distance between two strings allows insertion, deletion, substitution, or transposition of characters.
    static int
    levenshtein(char[] a, char[] b)
    Levenshtein distance between two strings allows insertion, deletion, or substitution of characters.
    static int
    Levenshtein distance between two strings allows insertion, deletion, or substitution of characters.
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

    Methods inherited from interface smile.math.distance.Distance

    apply, applyAsDouble, D, D
  • Constructor Details

    • EditDistance

      public EditDistance()
      Constructor. Multi-thread safe Levenshtein distance.
    • EditDistance

      public EditDistance(boolean damerau)
      Constructor. Multi-thread safe Damerau-Levenshtein distance.
      Parameters:
      damerau - if true, calculate Damerau-Levenshtein distance instead of plain Levenshtein distance.
    • EditDistance

      public EditDistance(int maxStringLength)
      Constructor. Highly efficient Levenshtein distance but not multi-thread safe.
      Parameters:
      maxStringLength - the maximum length of strings that will be feed to this algorithm.
    • EditDistance

      public EditDistance(int maxStringLength, boolean damerau)
      Constructor. Highly efficient Damerau-Levenshtein distance but not multi-thread safe.
      Parameters:
      maxStringLength - the maximum length of strings that will be feed to this algorithm.
      damerau - if true, calculate Damerau-Levenshtein distance instead of plain Levenshtein distance.
    • EditDistance

      public EditDistance(int[][] weight)
      Constructor. Weighted Levenshtein distance without path constraints. Only insertion, deletion, and substitution operations are supported.
      Parameters:
      weight - the weight matrix.
    • EditDistance

      public EditDistance(int[][] weight, double radius)
      Constructor. Weighted Levenshtein distance with Sakoe-Chiba band, which improve computational cost. Only insertion, deletion, and substitution operations are supported.
      Parameters:
      weight - the weight matrix.
      radius - the window width of Sakoe-Chiba band in terms of percentage of sequence length.
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class Object
    • d

      public double d(String a, String b)
      Edit distance between two strings. O(mn) time and O(n) space for weighted edit distance. O(ne) time and O(mn) space for unit cost edit distance. For weighted edit distance, this method is multi-thread safe. However, it is NOT multi-thread safe for unit cost edit distance.
      Specified by:
      d in interface Distance<String>
      Parameters:
      a - an object.
      b - an object.
      Returns:
      the distance.
    • d

      public double d(char[] a, char[] b)
      Edit distance between two strings. O(mn) time and O(n) space for weighted edit distance. O(ne) time and O(mn) space for unit cost edit distance. For weighted edit distance, this method is multi-thread safe. However, it is NOT multi-thread safe for unit cost edit distance.
      Parameters:
      a - a string.
      b - a string.
      Returns:
      the distance.
    • levenshtein

      public static int levenshtein(String a, String b)
      Levenshtein distance between two strings allows insertion, deletion, or substitution of characters. O(mn) time and O(n) space. Multi-thread safe.
      Parameters:
      a - a string.
      b - a string.
      Returns:
      the distance.
    • levenshtein

      public static int levenshtein(char[] a, char[] b)
      Levenshtein distance between two strings allows insertion, deletion, or substitution of characters. O(mn) time and O(n) space. Multi-thread safe.
      Parameters:
      a - a string.
      b - a string.
      Returns:
      the distance.
    • damerau

      public static int damerau(String a, String b)
      Damerau-Levenshtein distance between two strings allows insertion, deletion, substitution, or transposition of characters. O(mn) time and O(n) space. Multi-thread safe.
      Parameters:
      a - a string.
      b - a string.
      Returns:
      the distance.
    • damerau

      public static int damerau(char[] a, char[] b)
      Damerau-Levenshtein distance between two strings allows insertion, deletion, substitution, or transposition of characters. O(mn) time and O(n) space. Multi-thread safe.
      Parameters:
      a - a string.
      b - a string.
      Returns:
      the distance.