Class KMeans
K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd's algorithm as a variation of the EM algorithm.
However, the k-means algorithm has at least two major theoretic shortcomings:
- First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
- Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn. Therefore, it is common to run multiple times with different random initializations.
We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.
K-means is a hard clustering method, i.e. each observation is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign observations to different clusters with different probabilities.
References
- Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.
- D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
- Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionstatic CentroidClustering
<double[], double[]> fit
(double[][] data, int k, int maxIter) Fits k-means clustering.static CentroidClustering
<double[], double[]> fit
(double[][] data, Clustering.Options options) Fits k-means clustering.static CentroidClustering
<double[], double[]> fit
(BBDTree bbd, double[][] data, Clustering.Options options) Partitions data into k clusters.static CentroidClustering
<double[], double[]> lloyd
(double[][] data, int k, int maxIter) Fits k-means clustering on the data containing missing values (NaN).static CentroidClustering
<double[], double[]> lloyd
(double[][] data, Clustering.Options options) Fits k-means clustering on the data containing missing values (NaN).
-
Method Details
-
fit
Fits k-means clustering.- Parameters:
data
- the input data of which each row is an observation.k
- the number of clusters.maxIter
- the maximum number of iterations.- Returns:
- the model.
-
fit
public static CentroidClustering<double[],double[]> fit(double[][] data, Clustering.Options options) Fits k-means clustering.- Parameters:
data
- the input data of which each row is an observation.options
- the hyperparameters.- Returns:
- the model.
-
fit
public static CentroidClustering<double[],double[]> fit(BBDTree bbd, double[][] data, Clustering.Options options) Partitions data into k clusters.- Parameters:
bbd
- the BBD-tree of data for fast clustering.data
- the input data of which each row is an observation.options
- the hyperparameters.- Returns:
- the model.
-
lloyd
Fits k-means clustering on the data containing missing values (NaN).- Parameters:
data
- the input data of which each row is an observation.k
- the number of clusters.maxIter
- the maximum number of iterations.- Returns:
- the model.
-
lloyd
public static CentroidClustering<double[],double[]> lloyd(double[][] data, Clustering.Options options) Fits k-means clustering on the data containing missing values (NaN).- Parameters:
data
- the input data of which each row is an observation.options
- the hyperparameters.- Returns:
- the model.
-