kmeans

fun kmeans(data: Array<DoubleArray>, k: Int, maxIter: Int = 100, tol: Double = 1.0E-4, runs: Int = 16): KMeans

K-Means clustering. The algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

However, the k-means algorithm has at least two major theoretic shortcomings:

  • First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.

  • Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn.

In this implementation, we use k-means++ which addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.

K-means is a hard clustering method, i.e. each sample is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign samples to different clusters with different probabilities.

====References:====

  • Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.

  • D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.

  • Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.

This method runs the algorithm for given times and return the best one with smallest distortion.

Parameters

data

the data set.

k

the number of clusters.

maxIter

the maximum number of iterations for each running.

tol

the tolerance of convergence test.

runs

the number of runs of K-Means algorithm.