Class PartitionClustering

java.lang.Object
smile.clustering.PartitionClustering
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
CentroidClustering, DBSCAN, DENCLUE, MEC, SpectralClustering

public abstract class PartitionClustering extends Object implements Serializable
Partition clustering. Partition methods classify the observations into distinct non-overlapping groups.
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    final int
    The number of clusters.
    static final int
    Cluster label for outliers or noises.
    final int[]
    The number of observations in each cluster.
    final int[]
    The cluster labels of data.
  • Constructor Summary

    Constructors
    Constructor
    Description
    PartitionClustering(int k, int[] y)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    static <T extends PartitionClustering & Comparable<? super T>>
    T
    run(int runs, Supplier<T> clustering)
    Runs a clustering algorithm multiple times and return the best one (e.g.
    static <T> double[]
    seed(T[] data, T[] medoids, int[] y, ToDoubleBiFunction<T,T> distance)
    Initialize cluster membership of input objects with K-Means++ algorithm.
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • OUTLIER

      public static final int OUTLIER
      Cluster label for outliers or noises.
      See Also:
    • k

      public final int k
      The number of clusters.
    • y

      public final int[] y
      The cluster labels of data.
    • size

      public final int[] size
      The number of observations in each cluster.
  • Constructor Details

    • PartitionClustering

      public PartitionClustering(int k, int[] y)
      Constructor.
      Parameters:
      k - the number of clusters.
      y - the cluster labels.
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class Object
    • seed

      public static <T> double[] seed(T[] data, T[] medoids, int[] y, ToDoubleBiFunction<T,T> distance)
      Initialize cluster membership of input objects with K-Means++ algorithm. Many clustering methods, e.g. k-means, need an initial clustering configuration as a seed.

      K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.

      The exact algorithm is as follows:

      1. Choose one center uniformly at random from among the data points.
      2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
      3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
      4. Repeat Steps 2 and 3 until k centers have been chosen.
      5. Now that the initial centers have been chosen, proceed using standard k-means clustering.
      This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too.
      1. D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
      2. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
      Type Parameters:
      T - the type of input object.
      Parameters:
      data - data objects array of size n.
      medoids - an array of size k to store cluster medoids on output.
      y - an array of size n to store cluster labels on output.
      distance - the distance function.
      Returns:
      an array of size n to store the distance of each observation to nearest medoid.
    • run

      public static <T extends PartitionClustering & Comparable<? super T>> T run(int runs, Supplier<T> clustering)
      Runs a clustering algorithm multiple times and return the best one (e.g. smallest distortion).
      Type Parameters:
      T - the data type.
      Parameters:
      runs - the number of runs.
      clustering - the clustering algorithm.
      Returns:
      the model.