Packages

  • package root

    Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala.

    Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance.

    Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

    Definition Classes
    root
  • package smile
    Definition Classes
    root
  • package association

    Frequent item set mining and association rule mining.

    Frequent item set mining and association rule mining. Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Let I = {i1, i2,..., in} be a set of n binary attributes called items. Let D = {t1, t2,..., tm} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. An association rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = Ø. The item sets X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule, respectively. The support supp(X) of an item set X is defined as the proportion of transactions in the database which contain the item set. Note that the support of an association rule X ⇒ Y is supp(X ∪ Y). The confidence of a rule is defined conf(X ⇒ Y) = supp(X ∪ Y) / supp(X). Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.

    For example, the rule {onions, potatoes} ⇒ {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy burger. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements.

    Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:

    • First, minimum support is applied to find all frequent item sets in a database (i.e. frequent item set mining).
    • Second, these frequent item sets and the minimum confidence constraint are used to form rules.

    Finding all frequent item sets in a database is difficult since it involves searching all possible item sets (item combinations). The set of possible item sets is the power set over I (the set of items) and has size 2n - 1 (excluding the empty set which is not a valid item set). Although the size of the power set grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support (also called anti-monotonicity) which guarantees that for a frequent item set also all its subsets are frequent and thus for an infrequent item set, all its supersets must be infrequent.

    In practice, we may only consider the frequent item set that has the maximum number of items bypassing all the sub item sets. An item set is maximal frequent if none of its immediate supersets is frequent.

    For a maximal frequent item set, even though we know that all the sub item sets are frequent, we don't know the actual support of those sub item sets, which are very important to find the association rules within the item sets. If the final goal is association rule mining, we would like to discover closed frequent item sets. An item set is closed if none of its immediate supersets has the same support as the item set.

    Some well known algorithms of frequent item set mining are Apriori, Eclat and FP-Growth. Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the support of item sets and uses a candidate generation function which exploits the downward closure property of support. Eclat is a depth-first search algorithm using set intersection.

    FP-growth (frequent pattern growth) uses an extended prefix-tree (FP-tree) structure to store the database in a compressed form. FP-growth adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori.

    References:
    • R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules Between Sets of Items in Large Databases, SIGMOD, 207-216, 1993.
    • Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB, 487-499, 1994.
    • Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, 2000.
    • Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery 8:53-87, 2004.
    Definition Classes
    smile
  • package cas

    Computer algebra system.

    Computer algebra system. A computer algebra system (CAS) has the ability to manipulate mathematical expressions in a way similar to the traditional manual computations of mathematicians and scientists.

    The symbolic manipulations supported include:

    • simplification to a smaller expression or some standard form, including automatic simplification with assumptions and simplification with constraints
    • substitution of symbols or numeric values for certain expressions
    • change of form of expressions: expanding products and powers, partial and full factorization, rewriting as partial fractions, constraint satisfaction, rewriting trigonometric functions as exponentials, transforming logic expressions, etc.
    • partial and total differentiation
    • matrix operations including products, inverses, etc.
    Definition Classes
    smile
  • package classification

    Classification algorithms.

    Classification algorithms. In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given input object into one of a given number of categories. The input object is formally termed an instance, and the categories are termed classes.

    The instance is usually described by a vector of features, which together constitute a description of all known characteristics of the instance. Typically, features are either categorical (also known as nominal, i.e. consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting of one of a set of ordered items, e.g. "large", "medium" or "small"), integer-valued (e.g. a count of the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure).

    Classification normally refers to a supervised procedure, i.e. a procedure that produces an inferred function to predict the output value of new instances based on a training set of pairs consisting of an input object and a desired output value. The inferred function is called a classifier if the output is discrete or a regression function if the output is continuous.

    The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.

    A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems. The most widely used learning algorithms are AdaBoost and gradient boosting, support vector machines, linear regression, linear discriminant analysis, logistic regression, naive Bayes, decision trees, k-nearest neighbor algorithm, and neural networks (multilayer perceptron).

    If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms cannot be easily applied. Many algorithms, including linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees (and boosting algorithms based on decision trees) is that they easily handle heterogeneous data.

    If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.

    If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, linear support vector machines, naive Bayes) generally perform well. However, if there are complex interactions among features, then algorithms such as nonlinear support vector machines, decision trees and neural networks work better. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.

    There are several major issues to consider in supervised learning:

    • Features: The accuracy of the inferred function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output. There are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. More generally, dimensionality reduction may seek to map the input data into a lower dimensional space prior to running the supervised learning algorithm.
    • Overfitting: Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. The potential for overfitting depends not only on the number of parameters and data but also the conformability of the model structure with the data shape, and the magnitude of model error compared to the expected level of noise or error in the data. In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization. The basis of some techniques is either (1) to explicitly penalize overly complex models, or (2) to test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.
    • Regularization: Regularization involves introducing additional information in order to solve an ill-posed problem or to prevent over-fitting. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness or bounds on the vector space norm. A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
    • Bias-variance tradeoff: Mean squared error (MSE) can be broken down into two components: variance and squared bias, known as the bias-variance decomposition. Thus in order to minimize the MSE, we need to minimize both the bias and the variance. However, this is not trivial. Therefore, there is a tradeoff between bias and variance.
    Definition Classes
    smile
  • package clustering

    Clustering analysis.

    Clustering analysis. Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields.

    Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

    Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Many partitional clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.

    Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold.

    Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as two-way clustering, co-clustering or biclustering in bioinformatics: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods.

    Definition Classes
    smile
  • $dummy
  • package data

    Data manipulation functions.

    Data manipulation functions.

    Definition Classes
    smile
  • package feature

    Feature generation, normalization and selection.

    Feature generation, normalization and selection.

    Feature generation (or constructive induction) studies methods that modify or enhance the representation of data objects. Feature generation techniques search for new features that describe the objects better than the attributes supplied with the training instances.

    Many machine learning methods such as Neural Networks and SVM with Gaussian kernel also require the features properly scaled/standardized. For example, each variable is scaled into interval [0, 1] or to have mean 0 and standard deviation 1. Although some method such as decision trees can handle nominal variable directly, other methods generally require nominal variables converted to multiple binary dummy variables to indicate the presence or absence of a characteristic.

    Feature selection is the technique of selecting a subset of relevant features for building robust learning models. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by alleviating the effect of the curse of dimensionality, enhancing generalization capability, speeding up learning process, etc. More importantly, feature selection also helps researchers to acquire better understanding about the data.

    Feature selection algorithms typically fall into two categories: feature ranking and subset selection. Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset selection searches the set of possible features for the optimal subset. Clearly, an exhaustive search of optimal subset is impractical if large numbers of features are available. Commonly, heuristic methods such as genetic algorithms are employed for subset selection.

    Definition Classes
    smile
  • package imputation

    Missing value imputation.

    Missing value imputation. In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

    Data are missing for many reasons. Missing data can occur because of nonresponse: no information is provided for several items or no information is provided for a whole unit. Some items are more sensitive for nonresponse than others, for example items about private subjects such as income.

    Dropout is a type of missingness that occurs mostly when studying development over time. In this type of study the measurement is repeated after a certain period of time. Missingness occurs when participants drop out before the test ends and one or more measurements are missing.

    Sometimes missing values are caused by the device failure or even by researchers themselves. It is important to question why the data is missing, this can help with finding a solution to the problem. If the values are missing at random there is still information about each variable in each unit but if the values are missing systematically the problem is more severe because the sample cannot be representative of the population.

    All of the causes for missing data fit into four classes, which are based on the relationship between the missing data mechanism and the missing and observed values. These classes are important to understand because the problems caused by missing data and the solutions to these problems are different for the four classes.

    The first is Missing Completely at Random (MCAR). MCAR means that the missing data mechanism is unrelated to the values of any variables, whether missing or observed. Data that are missing because a researcher dropped the test tubes or survey participants accidentally skipped questions are likely to be MCAR. If the observed values are essentially a random sample of the full data set, complete case analysis gives the same results as the full data set would have. Unfortunately, most missing data are not MCAR.

    At the opposite end of the spectrum is Non-Ignorable (NI). NI means that the missing data mechanism is related to the missing values. It commonly occurs when people do not want to reveal something very personal or unpopular about themselves. For example, if individuals with higher incomes are less likely to reveal them on a survey than are individuals with lower incomes, the missing data mechanism for income is non-ignorable. Whether income is missing or observed is related to its value. Complete case analysis can give highly biased results for NI missing data. If proportionally more low and moderate income individuals are left in the sample because high income people are missing, an estimate of the mean income will be lower than the actual population mean.

    In between these two extremes are Missing at Random (MAR) and Covariate Dependent (CD). Both of these classes require that the cause of the missing data is unrelated to the missing values, but may be related to the observed values of other variables. MAR means that the missing values are related to either observed covariates or response variables, whereas CD means that the missing values are related only to covariates. As an example of CD missing data, missing income data may be unrelated to the actual income values, but are related to education. Perhaps people with more education are less likely to reveal their income than those with less education.

    A key distinction is whether the mechanism is ignorable (i.e., MCAR, CD, or MAR) or non-ignorable. There are excellent techniques for handling ignorable missing data. Non-ignorable missing data are more challenging and require a different approach.

    If it is known that the data analysis technique which is to be used isn't content robust, it is good to consider imputing the missing data. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data. The analysis should ideally take into account that there is a greater degree of uncertainty than if the imputed values had actually been observed, however, and this generally requires some modification of the standard complete-data analysis methods. Many imputation techniques are available.

    Imputation is not the only method available for handling missing data. The expectation-maximization algorithm is a method for finding maximum likelihood estimates that has been widely applied to missing data problems. In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. That was shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.

    Definition Classes
    smile
  • package manifold

    Manifold learning finds a low-dimensional basis for describing high-dimensional data.

    Manifold learning finds a low-dimensional basis for describing high-dimensional data. Manifold learning is a popular approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high; though each data point consists of perhaps thousands of features, it may be described as a function of only a few underlying parameters. That is, the data points are actually samples from a low-dimensional manifold that is embedded in a high-dimensional space. Manifold learning algorithms attempt to uncover these parameters in order to find a low-dimensional representation of the data.

    Some prominent approaches are locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.

    An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling (which is identical to PCA), Isomap (which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis.

    Definition Classes
    smile
  • package math

    Mathematical and statistical functions.

    Mathematical and statistical functions.

    Definition Classes
    smile
  • package mds

    Multidimensional scaling.

    Multidimensional scaling. MDS is a set of related statistical techniques often used in information visualization for exploring similarities or dissimilarities in data. An MDS algorithm starts with a matrix of item-item similarities, then assigns a location to each item in N-dimensional space. For sufficiently small N, the resulting locations may be displayed in a graph or 3D visualization.

    The major types of MDS algorithms include:

    Classical multidimensional scaling takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain.

    Metric multidimensional scaling is a superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress which is often minimized using a procedure called stress majorization.

    Non-metric multidimensional scaling finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. The relationship is typically found using isotonic regression.

    Generalized multidimensional scaling is an extension of metric multidimensional scaling, in which the target space is an arbitrary smooth non-Euclidean space. In case when the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another.

    Definition Classes
    smile
  • package nlp

    Natural language processing.

    Natural language processing.

    Definition Classes
    smile
  • package plot

    Data visualization.

    Data visualization.

    Definition Classes
    smile
  • package projection

    Feature extraction.

    Feature extraction. Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.

    The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system. The original space has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors.

    Compared to regular batch PCA algorithm, the generalized Hebbian algorithm is an adaptive method to find the largest k eigenvectors of the covariance matrix, assuming that the associated eigenvalues are distinct. GHA works with an arbitrarily large sample size and the storage requirement is modest. Another attractive feature is that, in a nonstationary environment, it has an inherent ability to track gradual changes in the optimal solution in an inexpensive way.

    Random projection is a promising linear dimensionality reduction technique for learning mixtures of Gaussians. The key idea of random projection arises from the Johnson-Lindenstrauss lemma: if points in a vector space are projected onto a randomly selected subspace of suitably high dimension, then the distances between the points are approximately preserved.

    Principal component analysis can be employed in a nonlinear way by means of the kernel trick. The resulting technique is capable of constructing nonlinear mappings that maximize the variance in the data. The resulting technique is entitled Kernel PCA. Other prominent nonlinear techniques include manifold learning techniques such as locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.

    An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling (which is identical to PCA), Isomap (which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis.

    A different approach to nonlinear dimensionality reduction is through the use of autoencoders, a special kind of feed-forward neural networks with a bottle-neck hidden layer. The training of deep encoders is typically performed using a greedy layer-wise pre-training (e.g., using a stack of Restricted Boltzmann machines) that is followed by a finetuning stage based on backpropagation.

    Definition Classes
    smile
  • package regression

    Regression analysis.

    Regression analysis. Regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables. Therefore, the estimation target is a function of the independent variables called the regression function. Regression analysis is widely used for prediction and forecasting.

    Definition Classes
    smile
  • package sequence

    Sequence labeling algorithms.

    Sequence labeling algorithms.

    Definition Classes
    smile
  • package util

    Utility functions.

    Utility functions.

    Definition Classes
    smile
  • package validation

    Model validation.

    Model validation.

    Definition Classes
    smile
  • package wavelet

    A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero.

    A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. Like the fast Fourier transform (FFT), the discrete wavelet transform (DWT) is a fast, linear operation that operates on a data vector whose length is an integer power of 2, transforming it into a numerically different vector of the same length. The wavelet transform is invertible and in fact orthogonal. Both FFT and DWT can be viewed as a rotation in function space.

    Definition Classes
    smile
p

smile

clustering

package clustering

Clustering analysis. Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields.

Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Many partitional clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.

Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold.

Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as two-way clustering, co-clustering or biclustering in bioinformatics: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. clustering
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. Protected

Value Members

  1. def clarans[T <: AnyRef](data: Array[T], distance: Distance[T], k: Int, maxNeighbor: Int, numLocal: Int = 16): CLARANS[T]

    Clustering Large Applications based upon RANdomized Search.

    Clustering Large Applications based upon RANdomized Search. CLARANS is an efficient medoid-based clustering algorithm. The k-medoids algorithm is an adaptation of the k-means algorithm. Rather than calculate the mean of the items in each cluster, a representative item, or medoid, is chosen for each cluster at each iteration. In CLARANS, the process of finding k medoids from n objects is viewed abstractly as searching through a certain graph. In the graph, a node is represented by a set of k objects as selected medoids. Two nodes are neighbors if their sets differ by only one object. In each iteration, CLARANS considers a set of randomly chosen neighbor nodes as candidate of new medoids. We will move to the neighbor node if the neighbor is a better choice for medoids. Otherwise, a local optima is discovered. The entire process is repeated multiple time to find better.

    CLARANS has two parameters: the maximum number of neighbors examined (maxNeighbor) and the number of local minima obtained (numLocal). The higher the value of maxNeighbor, the closer is CLARANS to PAM, and the longer is each search of a local minima. But the quality of such a local minima is higher and fewer local minima needs to be obtained.

    References:
    • R. Ng and J. Han. CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE TRANS. KNOWLEDGE AND DATA ENGINEERING, 2002.
    data

    the data set.

    distance

    the distance/dissimilarity measure.

    k

    the number of clusters.

    maxNeighbor

    the maximum number of neighbors examined during a random search of local minima.

    numLocal

    the number of local minima to search for.

  2. def dac(data: Array[Array[Double]], k: Int, alpha: Double = 0.9, maxIter: Int = 100, tol: Double = 1E-4, splitTol: Double = 1E-2): DeterministicAnnealing

    Deterministic annealing clustering.

    Deterministic annealing clustering. Deterministic annealing extends soft-clustering to an annealing process. For each temperature value, the algorithm iterates between the calculation of all posteriori probabilities and the update of the centroids vectors, until convergence is reached. The annealing starts with a high temperature. Here, all centroids vectors converge to the center of the pattern distribution (independent of their initial positions). Below a critical temperature the vectors start to split. Further decreasing the temperature leads to more splittings until all centroids vectors are separate. The annealing can therefore avoid (if it is sufficiently slow) the convergence to local minima.

    References:
    • Kenneth Rose. Deterministic Annealing for Clustering, Compression, Classification, Regression, and Speech Recognition.
    data

    the data set.

    k

    the maximum number of clusters.

    alpha

    the temperature T is decreasing as T = T * alpha. alpha has to be in (0, 1).

    tol

    the tolerance of convergence test.

    splitTol

    the tolerance to split a cluster.

  3. def dbscan(data: Array[Array[Double]], minPts: Int, radius: Double): DBSCAN[Array[Double]]

    DBSCAN with Euclidean distance.

    DBSCAN with Euclidean distance. DBSCAN finds a number of clusters starting from the estimated density distribution of corresponding nodes.

    data

    the data set.

    minPts

    the minimum number of neighbors for a core data point.

    radius

    the neighborhood radius.

  4. def dbscan[T <: AnyRef](data: Array[T], distance: Distance[T], minPts: Int, radius: Double): DBSCAN[T]

    Density-Based Spatial Clustering of Applications with Noise.

    Density-Based Spatial Clustering of Applications with Noise. DBSCAN finds a number of clusters starting from the estimated density distribution of corresponding nodes.

    data

    the data set.

    distance

    the distance metric.

    minPts

    the minimum number of neighbors for a core data point.

    radius

    the neighborhood radius.

  5. def dbscan[T <: AnyRef](data: Array[T], nns: RNNSearch[T, T], minPts: Int, radius: Double): DBSCAN[T]

    Density-Based Spatial Clustering of Applications with Noise.

    Density-Based Spatial Clustering of Applications with Noise. DBSCAN finds a number of clusters starting from the estimated density distribution of corresponding nodes.

    DBSCAN requires two parameters: radius (i.e. neighborhood radius) and the number of minimum points required to form a cluster (minPts). It starts with an arbitrary starting point that has not been visited. This point's neighborhood is retrieved, and if it contains sufficient number of points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized radius-environment of a different point and hence be made part of a cluster.

    If a point is found to be part of a cluster, its neighborhood is also part of that cluster. Hence, all points that are found within the neighborhood are added, as is their own neighborhood. This process continues until the cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster of noise.

    DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of nearest neighbor queries. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes such a neighborhood query in O(log n), an overall runtime complexity of O(n log n) is obtained.

    DBSCAN has many advantages such as

    • DBSCAN does not need to know the number of clusters in the data a priori, as opposed to k-means.
    • DBSCAN can find arbitrarily shaped clusters. It can even find clusters completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
    • DBSCAN has a notion of noise.
    • DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (Only points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)

    On the other hand, DBSCAN has the disadvantages of

    • In high dimensional space, the data are sparse everywhere because of the curse of dimensionality. Therefore, DBSCAN doesn't work well on high-dimensional data in general.
    • DBSCAN does not respond well to data sets with varying densities.
    References:
    • Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu (1996-). A density-based algorithm for discovering clusters in large spatial databases with noise". KDD, 1996.
    • Jorg Sander, Martin Ester, Hans-Peter Kriegel, Xiaowei Xu. (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. 1998.
    data

    the data set.

    nns

    the data structure for neighborhood search.

    minPts

    the minimum number of neighbors for a core data point.

    radius

    the neighborhood radius.

  6. def denclue(data: Array[Array[Double]], sigma: Double, m: Int): DENCLUE

    DENsity CLUstering.

    DENsity CLUstering. The DENCLUE algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points going to the same local maximum are put into the same cluster.

    Clearly, DENCLUE doesn't work on data with uniform distribution. In high dimensional space, the data always look like uniformly distributed because of the curse of dimensionality. Therefore, DENCLUDE doesn't work well on high-dimensional data in general.

    References:
    • A. Hinneburg and D. A. Keim. A general approach to clustering in large databases with noise. Knowledge and Information Systems, 5(4):387-415, 2003.
    • Alexander Hinneburg and Hans-Henning Gabriel. DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation. IDA, 2007.
    data

    the data set.

    sigma

    the smooth parameter in the Gaussian kernel. The user can choose sigma such that number of density attractors is constant for a long interval of sigma.

    m

    the number of selected samples used in the iteration. This number should be much smaller than the number of data points to speed up the algorithm. It should also be large enough to capture the sufficient information of underlying distribution.

  7. def gmeans(data: Array[Array[Double]], k: Int = 100): GMeans

    G-Means clustering algorithm, an extended K-Means which tries to automatically determine the number of clusters by normality test.

    G-Means clustering algorithm, an extended K-Means which tries to automatically determine the number of clusters by normality test. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian.

    References:
    • G. Hamerly and C. Elkan. Learning the k in k-means. NIPS, 2003.
    data

    the data set.

    k

    the maximum number of clusters.

  8. def hclust[T <: AnyRef](data: Array[T], distance: Distance[T], method: String): HierarchicalClustering

    Agglomerative Hierarchical Clustering.

    Agglomerative Hierarchical Clustering. This method seeks to build a hierarchy of clusters in a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram.

    In general, the merges are determined in a greedy manner. In order to decide which clusters should be combined, a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric, and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

    Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.

    References

    • David Eppstein. Fast hierarchical clustering and other applications of dynamic closest pairs. SODA 1998.
    data

    The data set.

    distance

    the distance/dissimilarity measure.

    method

    the agglomeration method to merge clusters. This should be one of "single", "complete", "upgma", "upgmc", "wpgma", "wpgmc", and "ward".

  9. def hclust(data: Array[Array[Double]], method: String): HierarchicalClustering

    Agglomerative Hierarchical Clustering.

    Agglomerative Hierarchical Clustering. This method seeks to build a hierarchy of clusters in a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram.

    In general, the merges are determined in a greedy manner. In order to decide which clusters should be combined, a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric, and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

    Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.

    References

    • David Eppstein. Fast hierarchical clustering and other applications of dynamic closest pairs. SODA 1998.
    data

    The data set.

    method

    the agglomeration method to merge clusters. This should be one of "single", "complete", "upgma", "upgmc", "wpgma", "wpgmc", and "ward".

  10. def kmeans(data: Array[Array[Double]], k: Int, maxIter: Int = 100, tol: Double = 1E-4, runs: Int = 16): KMeans

    K-Means clustering.

    K-Means clustering. The algorithm partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the k-means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

    However, the k-means algorithm has at least two major theoretic shortcomings:

    • First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
    • Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn.

    In this implementation, we use k-means++ which addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

    We also use k-d trees to speed up each k-means step as described in the filter algorithm by Kanungo, et al.

    K-means is a hard clustering method, i.e. each sample is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign samples to different clusters with different probabilities.

    References:
    • Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002.
    • D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.
    • Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.

    This method runs the algorithm for given times and return the best one with smallest distortion.

    data

    the data set.

    k

    the number of clusters.

    maxIter

    the maximum number of iterations for each running.

    tol

    the tolerance of convergence test.

    runs

    the number of runs of K-Means algorithm.

  11. def kmodes(data: Array[Array[Int]], k: Int, maxIter: Int = 100, runs: Int = 10): KModes

    K-Modes clustering.

    K-Modes clustering. K-Modes is the binary equivalent for K-Means. The mean update for centroids is replace by the mode one which is a majority vote among element of each cluster.

  12. def mec[T <: AnyRef](data: Array[T], nns: RNNSearch[T, T], k: Int, radius: Double, y: Array[Int], tol: Double = 1E-4): MEC[T]

    Nonparametric Minimum Conditional Entropy Clustering.

    Nonparametric Minimum Conditional Entropy Clustering.

    data

    the data set.

    nns

    the data structure for neighborhood search.

    k

    the number of clusters. Note that this is just a hint. The final number of clusters may be less.

    radius

    the neighborhood radius.

    tol

    the tolerance of convergence test.

  13. def mec(data: Array[Array[Double]], k: Int, radius: Double): MEC[Array[Double]]

    Nonparametric Minimum Conditional Entropy Clustering.

    Nonparametric Minimum Conditional Entropy Clustering. Assume Euclidean distance.

    data

    the data set.

    k

    the number of clusters. Note that this is just a hint. The final number of clusters may be less.

    radius

    the neighborhood radius.

  14. def mec[T <: AnyRef](data: Array[T], distance: Metric[T], k: Int, radius: Double): MEC[T]

    Nonparametric Minimum Conditional Entropy Clustering.

    Nonparametric Minimum Conditional Entropy Clustering.

    data

    the data set.

    distance

    the distance measure for neighborhood search.

    k

    the number of clusters. Note that this is just a hint. The final number of clusters may be less.

    radius

    the neighborhood radius.

  15. def mec[T <: AnyRef](data: Array[T], distance: Distance[T], k: Int, radius: Double): MEC[T]

    Nonparametric Minimum Conditional Entropy Clustering.

    Nonparametric Minimum Conditional Entropy Clustering. This method performs very well especially when the exact number of clusters is unknown. The method can also correctly reveal the structure of data and effectively identify outliers simultaneously.

    The clustering criterion is based on the conditional entropy H(C | x), where C is the cluster label and x is an observation. According to Fano's inequality, we can estimate C with a low probability of error only if the conditional entropy H(C | X) is small. MEC also generalizes the criterion by replacing Shannon's entropy with Havrda-Charvat's structural α-entropy. Interestingly, the minimum entropy criterion based on structural α-entropy is equal to the probability error of the nearest neighbor method when α = 2. To estimate p(C | x), MEC employs Parzen density estimation, a nonparametric approach.

    MEC is an iterative algorithm starting with an initial partition given by any other clustering methods, e.g. k-means, CLARNAS, hierarchical clustering, etc. Note that a random initialization is NOT appropriate.

    References:
    • Haifeng Li. All rights reserved., Keshu Zhang, and Tao Jiang. Minimum Entropy Clustering and Applications to Gene Expression Analysis. CSB, 2004.
    data

    the data set.

    distance

    the distance measure for neighborhood search.

    k

    the number of clusters. Note that this is just a hint. The final number of clusters may be less.

    radius

    the neighborhood radius.

  16. def sib(data: Array[SparseArray], k: Int, maxIter: Int = 100, runs: Int = 8): SIB

    The Sequential Information Bottleneck algorithm.

    The Sequential Information Bottleneck algorithm. SIB clusters co-occurrence data such as text documents vs words. SIB is guaranteed to converge to a local maximum of the information. Moreover, the time and space complexity are significantly improved in contrast to the agglomerative IB algorithm.

    In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.

    Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence.

    References:
    • N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. 1999.
    • N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. ACM SIGIR, 2002.
    • Jaakko Peltonen, Janne Sinkkonen, and Samuel Kaski. Sequential information bottleneck for finite data. ICML, 2004.
    data

    the data set.

    k

    the number of clusters.

    maxIter

    the maximum number of iterations.

    runs

    the number of runs of SIB algorithm.

  17. def specc(data: Array[Array[Double]], k: Int, l: Int, sigma: Double): SpectralClustering

    Spectral clustering with Nystrom approximation.

    Spectral clustering with Nystrom approximation.

    data

    the dataset for clustering.

    k

    the number of clusters.

    l

    the number of random samples for Nystrom approximation.

    sigma

    the smooth/width parameter of Gaussian kernel, which is a somewhat sensitive parameter. To search for the best setting, one may pick the value that gives the tightest clusters (smallest distortion, see { @link #distortion()}) in feature space.

  18. def specc(data: Array[Array[Double]], k: Int, sigma: Double): SpectralClustering

    Spectral clustering.

    Spectral clustering.

    data

    the dataset for clustering.

    k

    the number of clusters.

    sigma

    the smooth/width parameter of Gaussian kernel, which is a somewhat sensitive parameter. To search for the best setting, one may pick the value that gives the tightest clusters (smallest distortion, see { @link #distortion()}) in feature space.

  19. def specc(W: Matrix, k: Int): SpectralClustering

    Spectral Clustering.

    Spectral Clustering. Given a set of data points, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. Then the clustering will be performed in the dimension-reduce space, in which clusters of non-convex shape may become tight. There are some intriguing similarities between spectral clustering methods and kernel PCA, which has been empirically observed to perform clustering.

    References:
    • A.Y. Ng, M.I. Jordan, and Y. Weiss. On Spectral Clustering: Analysis and an algorithm. NIPS, 2001.
    • Marina Maila and Jianbo Shi. Learning segmentation by random walks. NIPS, 2000.
    • Deepak Verma and Marina Meila. A Comparison of Spectral Clustering Algorithms. 2003.
    W

    the adjacency matrix of graph.

    k

    the number of clusters.

  20. def xmeans(data: Array[Array[Double]], k: Int = 100): XMeans

    X-Means clustering algorithm, an extended K-Means which tries to automatically determine the number of clusters based on BIC scores.

    X-Means clustering algorithm, an extended K-Means which tries to automatically determine the number of clusters based on BIC scores. Starting with only one cluster, the X-Means algorithm goes into action after each run of K-Means, making local decisions about which subset of the current centroids should split themselves in order to better fit the data. The splitting decision is done by computing the Bayesian Information Criterion (BIC).

    References:
    • Dan Pelleg and Andrew Moore. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML, 2000.
    data

    the data set.

    k

    the maximum number of clusters.

  21. object $dummy

    Hacking scaladoc issue-8124.

    Hacking scaladoc issue-8124. The user should ignore this object.

Inherited from AnyRef

Inherited from Any

Ungrouped