public class SIB extends CentroidClustering<double[],SparseArray>
In analogy to KMeans, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with KullbackLeibler divergence, which is clearly a better dissimilarity measure for cooccurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of KMeans won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because KL divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.
Note that this implementation has a little difference from the original paper, in which a weighted JensenShannon divergence is employed as a criterion to assign a randomlypicked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted JensenShannon divergence.
centroids, distortion
k, OUTLIER, size, y
Constructor and Description 

SIB(double distortion,
double[][] centroids,
int[] y)
Constructor.

Modifier and Type  Method and Description 

double 
distance(double[] x,
SparseArray y)
The distance function.

static SIB 
fit(SparseArray[] data,
int k)
Clustering data into k clusters up to 100 iterations.

static SIB 
fit(SparseArray[] data,
int k,
int maxIter)
Clustering data into k clusters.

compareTo, predict, toString
run, seed
public double distance(double[] x, SparseArray y)
CentroidClustering
distance
in class CentroidClustering<double[],SparseArray>
public static SIB fit(SparseArray[] data, int k)
data
 the sparse normalized cooccurrence dataset of which each
row is an observation of which the sum is 1.k
 the number of clusters.public static SIB fit(SparseArray[] data, int k, int maxIter)
data
 the sparse normalized cooccurrence dataset of which each
row is an observation of which the sum is 1.k
 the number of clusters.maxIter
 the maximum number of iterations.