Record Class CentroidClustering<T,U>
- Type Parameters:
T- the type of centroids.U- the type of observations. Usually, T and U are the same. But in case of SIB, they are different.- Record Components:
name- the clustering algorithm name.centers- the cluster centroids or medoids.distance- the distance function.group- the cluster labels of data.proximity- the squared distance between data points and their respective cluster centers.size- the number of data points in each cluster.distortions- the average squared distance of data points within each cluster.
- All Implemented Interfaces:
Serializable, Comparable<CentroidClustering<T,U>>
Variations of k-means include restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means), etc.
Most k-means-type algorithms require the number of clusters to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders of clusters (which is not surprising since the algorithm optimizes cluster centers, not cluster borders).
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionCentroidClustering(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity) Constructor.CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity, int[] size, double[] distortions) Creates an instance of aCentroidClusteringrecord class. -
Method Summary
Modifier and TypeMethodDescriptioncenter(int i) Returns the center of i-th cluster.T[]centers()Returns the value of thecentersrecord component.intdistance()Returns the value of thedistancerecord component.doubleReturns the average squared distance between data points and their respective cluster centers.double[]Returns the value of thedistortionsrecord component.final booleanIndicates whether some other object is "equal to" this one.int[]group()Returns the value of thegrouprecord component.intgroup(int i) Returns the cluster label of i-th data point.final inthashCode()Returns a hash code value for this object.static <T> CentroidClustering<T, T> init(String name, T[] data, int k, ToDoubleBiFunction<T, T> distance) Returns a random clustering based on K-Means++ algorithm.intk()Returns the number of clusters.name()Returns the value of thenamerecord component.intClassifies a new observation.double[]Returns the value of theproximityrecord component.doubleproximity(int i) Returns the distance of i-th data point to its cluster center.doubleradius(int i) Returns the radius of i-th cluster.static double[][]seeds(double[][] data, int k) Selects random samples as seeds for various algorithms.int[]size()Returns the value of thesizerecord component.intsize(int i) Returns the size of i-th cluster.toString()Returns a string representation of this record class.
-
Constructor Details
-
CentroidClustering
public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity) Constructor.- Parameters:
name- the clustering algorithm name.centers- the cluster centroids or medoids.distance- the distance function.group- the cluster labels of data.proximity- the squared distance of each data point to its nearest cluster center.
-
CentroidClustering
public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T, U> distance, int[] group, double[] proximity, int[] size, double[] distortions) Creates an instance of aCentroidClusteringrecord class.- Parameters:
name- the value for thenamerecord componentcenters- the value for thecentersrecord componentdistance- the value for thedistancerecord componentgroup- the value for thegrouprecord componentproximity- the value for theproximityrecord componentsize- the value for thesizerecord componentdistortions- the value for thedistortionsrecord component
-
-
Method Details
-
k
public int k()Returns the number of clusters.- Returns:
- the number of clusters.
-
distortion
public double distortion()Returns the average squared distance between data points and their respective cluster centers. This is also known as the within-cluster sum-of-squares (WCSS).- Returns:
- the distortion.
-
compareTo
- Specified by:
compareToin interfaceComparable<T>
-
toString
-
center
Returns the center of i-th cluster.- Parameters:
i- the index of cluster.- Returns:
- the cluster center.
-
group
public int group(int i) Returns the cluster label of i-th data point.- Parameters:
i- the index of data point.- Returns:
- the cluster label.
-
proximity
public double proximity(int i) Returns the distance of i-th data point to its cluster center.- Parameters:
i- the index of data point.- Returns:
- the distance to cluster center.
-
size
public int size(int i) Returns the size of i-th cluster.- Parameters:
i- the index of cluster.- Returns:
- the cluster size.
-
radius
public double radius(int i) Returns the radius of i-th cluster.- Parameters:
i- the index of cluster.- Returns:
- the cluster radius.
-
predict
Classifies a new observation.- Parameters:
x- a new observation.- Returns:
- the cluster label.
-
init
public static <T> CentroidClustering<T,T> init(String name, T[] data, int k, ToDoubleBiFunction<T, T> distance) Returns a random clustering based on K-Means++ algorithm. Many clustering methods, e.g. k-means, need an initial clustering configuration as a seed.K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.
The exact algorithm is as follows:
- Choose one center uniformly at random from among the data points.
- For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
- Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D2(x).
- Repeat Steps 2 and 3 until k centers have been chosen.
- Now that the initial centers have been chosen, proceed using standard k-means clustering.
- D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
- Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
- Type Parameters:
T- the type of input object.- Parameters:
name- the clustering algorithm name.data- data objects array of size n.k- the number of medoids.distance- the distance function.- Returns:
- the initial clustering.
-
seeds
public static double[][] seeds(double[][] data, int k) Selects random samples as seeds for various algorithms.- Parameters:
data- samples to select seeds from.k- the number of seeds.- Returns:
- the seeds.
-
hashCode
-
equals
Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared withObjects::equals(Object,Object). -
name
-
centers
-
distance
-
group
-
proximity
-
size
-
distortions
public double[] distortions()Returns the value of thedistortionsrecord component.- Returns:
- the value of the
distortionsrecord component
-