smile.clustering.CentroidClustering<T,U>

Type Parameters:: T - the type of centroids.; U - the type of observations. Usually, T and U are the same. But in case of SIB, they are different.
Record Components:: name - the clustering algorithm name.; centers - the cluster centroids or medoids.; distance - the distance function.; group - the cluster labels of data.; proximity - the squared distance between data points and their respective cluster centers.; size - the number of data points in each cluster.; distortions - the average squared distance of data points within each cluster.

All Implemented Interfaces:: Serializable, Comparable<CentroidClustering<T,U>>

public record CentroidClustering<T,U>(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity, int[] size, double[] distortions) extends Record implements Comparable<CentroidClustering<T,U>>, Serializable

Centroid-based clustering that uses the center of each cluster to group similar data points into clusters. The cluster centers may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

Variations of k-means include restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means), etc.

Most k-means-type algorithms require the number of clusters to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders of clusters (which is not surprising since the algorithm optimizes cluster centers, not cluster borders).

See Also:

Constructor Summary

Constructors

Constructor

Description

CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity)

Constructor.

CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity, int[] size, double[] distortions)

Creates an instance of a CentroidClustering record class.
Method Summary

Modifier and Type

Method

Description

T

center(int i)

Returns the center of i-th cluster.

T[]

centers()

Returns the value of the centers record component.

int

compareTo(CentroidClustering<T,U> o)

ToDoubleBiFunction<T,U>

distance()

Returns the value of the distance record component.

double

distortion()

Returns the average squared distance between data points and their respective cluster centers.

double[]

distortions()

Returns the value of the distortions record component.

final boolean

equals(Object o)

Indicates whether some other object is "equal to" this one.

int[]

group()

Returns the value of the group record component.

int

group(int i)

Returns the cluster label of i-th data point.

final int

hashCode()

Returns a hash code value for this object.

static <T> CentroidClustering<T,T>

init(String name, T[] data, int k, ToDoubleBiFunction<T,T> distance)

Returns a random clustering based on K-Means++ algorithm.

int

k()

Returns the number of clusters.

String

name()

Returns the value of the name record component.

int

predict(U x)

Classifies a new observation.

double[]

proximity()

Returns the value of the proximity record component.

double

proximity(int i)

Returns the distance of i-th data point to its cluster center.

double

radius(int i)

Returns the radius of i-th cluster.

static double[][]

seeds(double[][] data, int k)

Selects random samples as seeds for various algorithms.

int[]

size()

Returns the value of the size record component.

int

size(int i)

Returns the size of i-th cluster.

String

toString()

Returns a string representation of this record class.

Methods inherited from class Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait

Constructor Details
- CentroidClustering
  
  public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity)
  
  Constructor.
  
  Parameters:
  
  name - the clustering algorithm name.
  
  centers - the cluster centroids or medoids.
  
  distance - the distance function.
  
  group - the cluster labels of data.
  
  proximity - the squared distance of each data point to its nearest cluster center.
- CentroidClustering
  
  public CentroidClustering(String name, T[] centers, ToDoubleBiFunction<T,U> distance, int[] group, double[] proximity, int[] size, double[] distortions)
  
  Creates an instance of a CentroidClustering record class.
  
  Parameters:
  
  name - the value for the name record component
  
  centers - the value for the centers record component
  
  distance - the value for the distance record component
  
  group - the value for the group record component
  
  proximity - the value for the proximity record component
  
  size - the value for the size record component
  
  distortions - the value for the distortions record component
Method Details
- k
  
  public int k()
  
  Returns the number of clusters.
  
  Returns:
  
  the number of clusters.
- distortion
  
  public double distortion()
  
  Returns the average squared distance between data points and their respective cluster centers. This is also known as the within-cluster sum-of-squares (WCSS).
  
  Returns:
  
  the distortion.
- compareTo
  
  public int compareTo(CentroidClustering<T,U> o)
  
  Specified by:
  
  compareTo in interface Comparable<T>
- toString
  
  public String toString()
  
  Returns a string representation of this record class. The representation contains the name of the class, followed by the name and value of each of the record components.
  
  Specified by:
  
  toString in class Record
  
  Returns:
  
  a string representation of this object
- center
  
  public T center(int i)
  
  Returns the center of i-th cluster.
  
  Parameters:
  
  i - the index of cluster.
  
  Returns:
  
  the cluster center.
- group
  
  public int group(int i)
  
  Returns the cluster label of i-th data point.
  
  Parameters:
  
  i - the index of data point.
  
  Returns:
  
  the cluster label.
- proximity
  
  public double proximity(int i)
  
  Returns the distance of i-th data point to its cluster center.
  
  Parameters:
  
  i - the index of data point.
  
  Returns:
  
  the distance to cluster center.
- size
  
  public int size(int i)
  
  Returns the size of i-th cluster.
  
  Parameters:
  
  i - the index of cluster.
  
  Returns:
  
  the cluster size.
- radius
  
  public double radius(int i)
  
  Returns the radius of i-th cluster.
  
  Parameters:
  
  i - the index of cluster.
  
  Returns:
  
  the cluster radius.
- predict
  
  public int predict(U x)
  
  Classifies a new observation.
  
  Parameters:
  
  x - a new observation.
  
  Returns:
  
  the cluster label.
- init
  public static <T> CentroidClustering<T,T> init(String name, T[] data, int k, ToDoubleBiFunction<T,T> distance)
  
  Returns a random clustering based on K-Means++ algorithm. Many clustering methods, e.g. k-means, need an initial clustering configuration as a seed.
  K-Means++ is based on the intuition of spreading the k initial cluster centers away from each other. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.
  The exact algorithm is as follows:
  
  Choose one center uniformly at random from among the data points.
  
  For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  
  Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D²(x).
  
  Repeat Steps 2 and 3 until k centers have been chosen.
  
  Now that the initial centers have been chosen, proceed using standard k-means clustering.
  
  This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too.
  
  D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
  
  Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
  
  Type Parameters:
  
  T - the type of input object.
  
  Parameters:
  
  name - the clustering algorithm name.
  
  data - data objects array of size n.
  
  k - the number of medoids.
  
  distance - the distance function.
  
  Returns:
  
  the initial clustering.
- seeds
  
  public static double[][] seeds(double[][] data, int k)
  
  Selects random samples as seeds for various algorithms.
  
  Parameters:
  
  data - samples to select seeds from.
  
  k - the number of seeds.
  
  Returns:
  
  the seeds.
- hashCode
  
  public final int hashCode()
  
  Returns a hash code value for this object. The value is derived from the hash code of each of the record components.
  
  Specified by:
  
  hashCode in class Record
  
  Returns:
  
  a hash code value for this object
- equals
  
  public final boolean equals(Object o)
  
  Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared with Objects::equals(Object,Object).
  
  Specified by:
  
  equals in class Record
  
  Parameters:
  
  o - the object with which to compare
  
  Returns:
  
  true if this object is the same as the o argument; false otherwise.
- name
  
  public String name()
  
  Returns the value of the name record component.
  
  Returns:
  
  the value of the name record component
- centers
  
  public T[] centers()
  
  Returns the value of the centers record component.
  
  Returns:
  
  the value of the centers record component
- distance
  
  public ToDoubleBiFunction<T,U> distance()
  
  Returns the value of the distance record component.
  
  Returns:
  
  the value of the distance record component
- group
  
  public int[] group()
  
  Returns the value of the group record component.
  
  Returns:
  
  the value of the group record component
- proximity
  
  public double[] proximity()
  
  Returns the value of the proximity record component.
  
  Returns:
  
  the value of the proximity record component
- size
  
  public int[] size()
  
  Returns the value of the size record component.
  
  Returns:
  
  the value of the size record component
- distortions
  
  public double[] distortions()
  
  Returns the value of the distortions record component.
  
  Returns:
  
  the value of the distortions record component

Record Class CentroidClustering<T,U>

Constructor Summary

Method Summary

Methods inherited from class Object

Constructor Details

CentroidClustering

CentroidClustering

Method Details

k

distortion

compareTo

toString

center

group

proximity

size

radius

predict

init

seeds

hashCode

equals

name

centers

distance

group

proximity

size

distortions