Class SIB
 All Implemented Interfaces:
Serializable
,Comparable<CentroidClustering<double[],
SparseArray>>
In analogy to KMeans, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with KullbackLeibler divergence, which is clearly a better dissimilarity measure for cooccurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of KMeans won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because KL divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.
Note that this implementation has a little difference from the original paper, in which a weighted JensenShannon divergence is employed as a criterion to assign a randomlypicked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted JensenShannon divergence.
References
 N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. 1999.
 N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR, 2002.
 Jaakko Peltonen, Janne Sinkkonen, and Samuel Kaski. Sequential information bottleneck for finite data. ICML, 2004.
 See Also:

Field Summary
Fields inherited from class smile.clustering.CentroidClustering
centroids, distortion
Fields inherited from class smile.clustering.PartitionClustering
k, OUTLIER, size, y

Constructor Summary

Method Summary
Modifier and TypeMethodDescriptionprotected double
distance
(double[] x, SparseArray y) The distance function.static SIB
fit
(SparseArray[] data, int k) Clustering data into k clusters up to 100 iterations.static SIB
fit
(SparseArray[] data, int k, int maxIter) Clustering data into k clusters.Methods inherited from class smile.clustering.CentroidClustering
compareTo, predict, toString
Methods inherited from class smile.clustering.PartitionClustering
run, seed

Constructor Details

SIB
public SIB(double distortion, double[][] centroids, int[] y) Constructor. Parameters:
distortion
 the total distortion.centroids
 the centroids of each cluster.y
 the cluster labels.


Method Details

distance
Description copied from class:CentroidClustering
The distance function. Specified by:
distance
in classCentroidClustering<double[],
SparseArray>  Parameters:
x
 an observation.y
 the other observation. Returns:
 the distance.

fit
Clustering data into k clusters up to 100 iterations. Parameters:
data
 the sparse normalized cooccurrence dataset of which each row is an observation of which the sum is 1.k
 the number of clusters. Returns:
 the model.

fit
Clustering data into k clusters. Parameters:
data
 the sparse normalized cooccurrence dataset of which each row is an observation of which the sum is 1.k
 the number of clusters.maxIter
 the maximum number of iterations. Returns:
 the model.
