Class KMeans
 All Implemented Interfaces:
Serializable
,Comparable<CentroidClustering<double[],
double[]>>
Kmeans has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd's algorithm as a variation of the EM algorithm.
However, the kmeans algorithm has at least two major theoretic shortcomings:
 First, it has been shown that the worst case running time of the algorithm is superpolynomial in the input size.
 Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn. Therefore, it is common to run multiple times with different random initializations.
We also use kd trees to speed up each kmeans step as described in the filter algorithm by Kanungo, et al.
Kmeans is a hard clustering method, i.e. each observation is assigned to a specific cluster. In contrast, soft clustering, e.g. the ExpectationMaximization algorithm for Gaussian mixtures, assign observations to different clusters with different probabilities.
References
 Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient kMeans Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.
 D. Arthur and S. Vassilvitskii. "Kmeans++: the advantages of careful seeding". ACMSIAM symposium on Discrete algorithms, 10271035, 2007.
 Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the Kmeans clustering algorithm. 2010.
 See Also:

Field Summary
Fields inherited from class smile.clustering.CentroidClustering
centroids, distortion
Fields inherited from class smile.clustering.PartitionClustering
k, OUTLIER, size, y

Constructor Summary

Method Summary
Modifier and TypeMethodDescriptionprotected double
distance
(double[] x, double[] y) The distance function.static KMeans
fit
(double[][] data, int k) Partitions data into k clusters up to 100 iterations.static KMeans
fit
(double[][] data, int k, int maxIter, double tol) Partitions data into k clusters up to 100 iterations.static KMeans
Partitions data into k clusters.static KMeans
lloyd
(double[][] data, int k) The implementation of Lloyd algorithm as a benchmark.static KMeans
lloyd
(double[][] data, int k, int maxIter, double tol) The implementation of Lloyd algorithm as a benchmark.Methods inherited from class smile.clustering.CentroidClustering
compareTo, predict, toString
Methods inherited from class smile.clustering.PartitionClustering
run, seed

Constructor Details

KMeans
public KMeans(double distortion, double[][] centroids, int[] y) Constructor. Parameters:
distortion
 the total distortion.centroids
 the centroids of each cluster.y
 the cluster labels.


Method Details

distance
protected double distance(double[] x, double[] y) Description copied from class:CentroidClustering
The distance function. Specified by:
distance
in classCentroidClustering<double[],
double[]>  Parameters:
x
 an observation.y
 the other observation. Returns:
 the distance.

fit
Partitions data into k clusters up to 100 iterations. Parameters:
data
 the input data of which each row is an observation.k
 the number of clusters. Returns:
 the model.

fit
Partitions data into k clusters up to 100 iterations. Parameters:
data
 the input data of which each row is an observation.k
 the number of clusters.maxIter
 the maximum number of iterations.tol
 the tolerance of convergence test. Returns:
 the model.

fit
Partitions data into k clusters. Parameters:
bbd
 the BBDtree of data for fast clustering.data
 the input data of which each row is an observation.k
 the number of clusters.maxIter
 the maximum number of iterations.tol
 the tolerance of convergence test. Returns:
 the model.

lloyd
The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN). The algorithm runs up to 100 iterations. Parameters:
data
 the input data of which each row is an observation.k
 the number of clusters. Returns:
 the model.

lloyd
The implementation of Lloyd algorithm as a benchmark. The data may contain missing values (i.e. Double.NaN). Parameters:
data
 the input data of which each row is an observation.k
 the number of clusters.maxIter
 the maximum number of iterations.tol
 the tolerance of convergence test. Returns:
 the model.
