sib

fun sib(data: Array<SparseArray>, k: Int, maxIter: Int = 100, runs: Int = 8): SIB

The Sequential Information Bottleneck algorithm. SIB clusters co-occurrence data such as text documents vs words. SIB is guaranteed to converge to a local maximum of the information. Moreover, the time and space complexity are significantly improved in contrast to the agglomerative IB algorithm.

In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.

Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence.

====References:====

  • N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. 1999.

  • N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR, 2002.

  • Jaakko Peltonen, Janne Sinkkonen, and Samuel Kaski. Sequential information bottleneck for finite data. ICML, 2004.

Parameters

data

the data set.

k

the number of clusters.

maxIter

the maximum number of iterations.

runs

the number of runs of SIB algorithm.