Class SIB
In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.
Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence.
References
- N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. 1999.
- N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR, 2002.
- Jaakko Peltonen, Janne Sinkkonen, and Samuel Kaski. Sequential information bottleneck for finite data. ICML, 2004.
-
Method Summary
Modifier and TypeMethodDescriptionstatic CentroidClustering
<double[], SparseArray> fit
(SparseArray[] data, int k, int maxIter) Clustering data into k clusters up to 100 iterations.static CentroidClustering
<double[], SparseArray> fit
(SparseArray[] data, Clustering.Options options) Clustering data into k clusters.
-
Method Details
-
fit
Clustering data into k clusters up to 100 iterations.- Parameters:
data
- the sparse normalized co-occurrence dataset of which each row is an observation of which the sum is 1.k
- the number of clusters.maxIter
- the maximum number of iterations.- Returns:
- the model.
-
fit
public static CentroidClustering<double[],SparseArray> fit(SparseArray[] data, Clustering.Options options) Clustering data into k clusters.- Parameters:
data
- the sparse normalized co-occurrence dataset of which each row is an observation of which the sum is 1.options
- the hyperparameters.- Returns:
- the model.
-