public class KMeans extends CentroidClustering<double[],double[]>
Kmeans has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based clustering, and Lloyd's algorithm as a variation of the EM algorithm.
However, the kmeans algorithm has at least two major theoretic shortcomings:
We also use kd trees to speed up each kmeans step as described in the filter algorithm by Kanungo, et al.
Kmeans is a hard clustering method, i.e. each observation is assigned to a specific cluster. In contrast, soft clustering, e.g. the ExpectationMaximization algorithm for Gaussian mixtures, assign observations to different clusters with different probabilities.
centroids, distortion
k, OUTLIER, size, y
Constructor and Description 

KMeans(double distortion,
double[][] centroids,
int[] y)
Constructor.

Modifier and Type  Method and Description 

double 
distance(double[] x,
double[] y)
The distance function.

static KMeans 
fit(BBDTree bbd,
double[][] data,
int k,
int maxIter,
double tol)
Partitions data into k clusters.

static KMeans 
fit(double[][] data,
int k)
Partitions data into k clusters up to 100 iterations.

static KMeans 
fit(double[][] data,
int k,
int maxIter,
double tol)
Partitions data into k clusters up to 100 iterations.

static KMeans 
lloyd(double[][] data,
int k)
The implementation of Lloyd algorithm as a benchmark.

static KMeans 
lloyd(double[][] data,
int k,
int maxIter,
double tol)
The implementation of Lloyd algorithm as a benchmark.

compareTo, predict, toString
run, seed
public KMeans(double distortion, double[][] centroids, int[] y)
distortion
 the total distortion.centroids
 the centroids of each cluster.y
 the cluster labels.public double distance(double[] x, double[] y)
CentroidClustering
distance
in class CentroidClustering<double[],double[]>
public static KMeans fit(double[][] data, int k)
data
 the input data of which each row is an observation.k
 the number of clusters.public static KMeans fit(double[][] data, int k, int maxIter, double tol)
data
 the input data of which each row is an observation.k
 the number of clusters.maxIter
 the maximum number of iterations.tol
 the tolerance of convergence test.public static KMeans fit(BBDTree bbd, double[][] data, int k, int maxIter, double tol)
bbd
 the BBDtree of data for fast clustering.data
 the input data of which each row is an observation.k
 the number of clusters.maxIter
 the maximum number of iterations.tol
 the tolerance of convergence test.public static KMeans lloyd(double[][] data, int k)
data
 the input data of which each row is an observation.k
 the number of clusters.public static KMeans lloyd(double[][] data, int k, int maxIter, double tol)
data
 the input data of which each row is an observation.k
 the number of clusters.maxIter
 the maximum number of iterations.tol
 the tolerance of convergence test.