Package smile.clustering

Clustering analysis.

Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields.

Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Many partitional clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.

Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold.

Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as two-way clustering, co-clustering or biclustering in bioinformatics: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods.

Functions

clarans
Link copied to clipboard
fun <T> clarans(data: Array<T>, distance: Distance<T>, k: Int, maxNeighbor: Int, numLocal: Int = 16): CLARANS<T>

Clustering Large Applications based upon RANdomized Search. CLARANS is an efficient medoid-based clustering algorithm. The k-medoids algorithm is an adaptation of the k-means algorithm. Rather than calculate the mean of the items in each cluster, a representative item, or medoid, is chosen for each cluster at each iteration. In CLARANS, the process of finding k medoids from n objects is viewed abstractly as searching through a certain graph. In the graph, a node is represented by a set of k objects as selected medoids. Two nodes are neighbors if their sets differ by only one object. In each iteration, CLARANS considers a set of randomly chosen neighbor nodes as candidate of new medoids. We will move to the neighbor node if the neighbor is a better choice for medoids. Otherwise, a local optima is discovered. The entire process is repeated multiple time to find better.

dac
Link copied to clipboard
fun dac(data: Array<DoubleArray>, k: Int, alpha: Double = 0.9, maxIter: Int = 100, tol: Double = 1E-4, splitTol: Double = 1E-2): DeterministicAnnealing

Deterministic annealing clustering. Deterministic annealing extends soft-clustering to an annealing process. For each temperature value, the algorithm iterates between the calculation of all posteriori probabilities and the update of the centroids vectors, until convergence is reached. The annealing starts with a high temperature. Here, all centroids vectors converge to the center of the pattern distribution (independent of their initial positions). Below a critical temperature the vectors start to split. Further decreasing the temperature leads to more splittings until all centroids vectors are separate. The annealing can therefore avoid (if it is sufficiently slow) the convergence to local minima.

dbscan
Link copied to clipboard
fun dbscan(data: Array<DoubleArray>, minPts: Int, radius: Double): DBSCAN<DoubleArray>

DBSCAN with Euclidean distance. DBSCAN finds a number of clusters starting from the estimated density distribution of corresponding nodes.

fun <T> dbscan(data: Array<T>, distance: Distance<T>, minPts: Int, radius: Double): DBSCAN<T>
fun <T> dbscan(data: Array<T>, nns: RNNSearch<T, T>, minPts: Int, radius: Double): DBSCAN<T>

Density-Based Spatial Clustering of Applications with Noise. DBSCAN finds a number of clusters starting from the estimated density distribution of corresponding nodes.

denclue
Link copied to clipboard
fun denclue(data: Array<DoubleArray>, sigma: Double, m: Int): DENCLUE

DENsity CLUstering. The DENCLUE algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points going to the same local maximum are put into the same cluster.

gmeans
Link copied to clipboard
fun gmeans(data: Array<DoubleArray>, k: Int = 100): GMeans

G-Means clustering algorithm, an extended K-Means which tries to automatically determine the number of clusters by normality test. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian.

hclust
Link copied to clipboard
fun hclust(data: Array<DoubleArray>, method: String): HierarchicalClustering
fun <T> hclust(data: Array<T>, distance: Distance<T>, method: String): HierarchicalClustering

Agglomerative Hierarchical Clustering. This method seeks to build a hierarchy of clusters in a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram.

kmeans
Link copied to clipboard
fun kmeans(data: Array<DoubleArray>, k: Int, maxIter: Int = 100, tol: Double = 1E-4, runs: Int = 16): KMeans

K-Means clustering. The algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

kmodes
Link copied to clipboard
fun kmodes(data: Array<IntArray>, k: Int, maxIter: Int = 100, runs: Int = 10): KModes

K-Modes clustering. K-Modes is the binary equivalent for K-Means. The mean update for centroids is replace by the mode one which is a majority vote among element of each cluster.

mec
Link copied to clipboard
fun mec(data: Array<DoubleArray>, k: Int, radius: Double): MEC<DoubleArray>

Nonparametric Minimum Conditional Entropy Clustering. Assume Euclidean distance.

fun <T> mec(data: Array<T>, distance: Distance<T>, k: Int, radius: Double): MEC<T>

Nonparametric Minimum Conditional Entropy Clustering. This method performs very well especially when the exact number of clusters is unknown. The method can also correctly reveal the structure of data and effectively identify outliers simultaneously.

fun <T> mec(data: Array<T>, distance: Metric<T>, k: Int, radius: Double): MEC<T>
fun <T> mec(data: Array<T>, nns: RNNSearch<T, T>, k: Int, radius: Double, y: IntArray, tol: Double = 1E-4): MEC<T>

Nonparametric Minimum Conditional Entropy Clustering.

sib
Link copied to clipboard
fun sib(data: Array<SparseArray>, k: Int, maxIter: Int = 100, runs: Int = 8): SIB

The Sequential Information Bottleneck algorithm. SIB clusters co-occurrence data such as text documents vs words. SIB is guaranteed to converge to a local maximum of the information. Moreover, the time and space complexity are significantly improved in contrast to the agglomerative IB algorithm.

specc
Link copied to clipboard
fun specc(W: Matrix, k: Int): SpectralClustering

Spectral Clustering. Given a set of data points, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. Then the clustering will be performed in the dimension-reduce space, in which clusters of non-convex shape may become tight. There are some intriguing similarities between spectral clustering methods and kernel PCA, which has been empirically observed to perform clustering.

fun specc(data: Array<DoubleArray>, k: Int, sigma: Double): SpectralClustering

Spectral clustering.

fun specc(data: Array<DoubleArray>, k: Int, l: Int, sigma: Double): SpectralClustering

Spectral clustering with Nystrom approximation.

xmeans
Link copied to clipboard
fun xmeans(data: Array<DoubleArray>, k: Int = 100): XMeans

X-Means clustering algorithm, an extended K-Means which tries to automatically determine the number of clusters based on BIC scores. Starting with only one cluster, the X-Means algorithm goes into action after each run of K-Means, making local decisions about which subset of the current centroids should split themselves in order to better fit the data. The splitting decision is done by computing the Bayesian Information Criterion (BIC).