Class MEC<T>

All Implemented Interfaces:
Serializable, Comparable<MEC<T>>

public class MEC<T> extends PartitionClustering implements Comparable<MEC<T>>
Non-parametric Minimum Conditional Entropy Clustering. This method performs very well especially when the exact number of clusters is unknown. The method can also correctly reveal the structure of data and effectively identify outliers simultaneously.

The clustering criterion is based on the conditional entropy H(C | x), where C is the cluster label and x is an observation. According to Fano's inequality, we can estimate C with a low probability of error only if the conditional entropy H(C | X) is small. MEC also generalizes the criterion by replacing Shannon's entropy with Havrda-Charvat's structural α-entropy. Interestingly, the minimum entropy criterion based on structural α-entropy is equal to the probability error of the nearest neighbor method when α= 2. To estimate p(C | x), MEC employs Parzen density estimation, a nonparametric approach.

MEC is an iterative algorithm starting with an initial partition given by any other clustering methods, e.g. k-means, CLARNAS, hierarchical clustering, etc. Note that a random initialization is NOT appropriate.

References

  1. Haifeng Li. All rights reserved., Keshu Zhang, and Tao Jiang. Minimum Entropy Clustering and Applications to Gene Expression Analysis. CSB, 2004.
See Also:
  • Field Details

    • entropy

      public final double entropy
      The conditional entropy as the objective function.
    • radius

      public final double radius
      The range of neighborhood.
  • Constructor Details

    • MEC

      public MEC(double entropy, double radius, RNNSearch<T,T> nns, int k, int[] y)
      Constructor.
      Parameters:
      entropy - the conditional entropy of clusters.
      radius - the neighborhood radius.
      nns - the data structure for neighborhood search.
      k - the number of clusters.
      y - the cluster labels.
  • Method Details

    • compareTo

      public int compareTo(MEC<T> o)
      Specified by:
      compareTo in interface Comparable<T>
    • fit

      public static <T> MEC<T> fit(T[] data, Distance<T> distance, int k, double radius)
      Clustering the data.
      Type Parameters:
      T - the data type.
      Parameters:
      data - the observations.
      distance - the distance function.
      k - the number of clusters. Note that this is just a hint. The final number of clusters may be less.
      radius - the neighborhood radius.
      Returns:
      the model.
    • fit

      public static <T> MEC<T> fit(T[] data, RNNSearch<T,T> nns, int k, double radius, int[] y, double tol)
      Clustering the data.
      Type Parameters:
      T - the data type.
      Parameters:
      data - the observations.
      nns - the neighborhood search data structure.
      k - the number of clusters. Note that this is just a hint. The final number of clusters may be less.
      radius - the neighborhood radius.
      y - the initial clustering labels, which could be produced by any other clustering methods.
      tol - the tolerance of convergence test.
      Returns:
      the model.
    • predict

      public int predict(T x)
      Cluster a new instance.
      Parameters:
      x - a new instance.
      Returns:
      the cluster label. Note that it may be PartitionClustering.OUTLIER.
    • toString

      public String toString()
      Overrides:
      toString in class PartitionClustering