Smile - Statistical Machine Intelligence and Learning Engine < Back

Packages

package root
Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala.
Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance.
Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.
Definition Classes
root
package smile
Definition Classes
root
package association
Frequent item set mining and association rule mining.
Frequent item set mining and association rule mining. Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Let I = {i₁, i₂,..., i_n} be a set of n binary attributes called items. Let D = {t₁, t₂,..., t_m} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. An association rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = Ø. The item sets X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule, respectively. The support supp(X) of an item set X is defined as the proportion of transactions in the database which contain the item set. Note that the support of an association rule X ⇒ Y is supp(X ∪ Y). The confidence of a rule is defined conf(X ⇒ Y) = supp(X ∪ Y) / supp(X). Confidence can be interpreted as an estimate of the probability P(Y | X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.
For example, the rule {onions, potatoes} ⇒ {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy burger. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements.
Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps:
- First, minimum support is applied to find all frequent item sets in a database (i.e. frequent item set mining).
- Second, these frequent item sets and the minimum confidence constraint are used to form rules.
Finding all frequent item sets in a database is difficult since it involves searching all possible item sets (item combinations). The set of possible item sets is the power set over I (the set of items) and has size 2ⁿ - 1 (excluding the empty set which is not a valid item set). Although the size of the power set grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support (also called anti-monotonicity) which guarantees that for a frequent item set also all its subsets are frequent and thus for an infrequent item set, all its supersets must be infrequent.
In practice, we may only consider the frequent item set that has the maximum number of items bypassing all the sub item sets. An item set is maximal frequent if none of its immediate supersets is frequent.
For a maximal frequent item set, even though we know that all the sub item sets are frequent, we don't know the actual support of those sub item sets, which are very important to find the association rules within the item sets. If the final goal is association rule mining, we would like to discover closed frequent item sets. An item set is closed if none of its immediate supersets has the same support as the item set.
Some well known algorithms of frequent item set mining are Apriori, Eclat and FP-Growth. Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the support of item sets and uses a candidate generation function which exploits the downward closure property of support. Eclat is a depth-first search algorithm using set intersection.
FP-growth (frequent pattern growth) uses an extended prefix-tree (FP-tree) structure to store the database in a compressed form. FP-growth adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori.
References:
- R. Agrawal, T. Imielinski and A. Swami. Mining Association Rules Between Sets of Items in Large Databases, SIGMOD, 207-216, 1993.
- Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB, 487-499, 1994.
- Mohammed J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, 2000.
- Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery 8:53-87, 2004.
Definition Classes
smile
package cas
Computer algebra system.
Computer algebra system. A computer algebra system (CAS) has the ability to manipulate mathematical expressions in a way similar to the traditional manual computations of mathematicians and scientists.
The symbolic manipulations supported include:
- simplification to a smaller expression or some standard form, including automatic simplification with assumptions and simplification with constraints
- substitution of symbols or numeric values for certain expressions
- change of form of expressions: expanding products and powers, partial and full factorization, rewriting as partial fractions, constraint satisfaction, rewriting trigonometric functions as exponentials, transforming logic expressions, etc.
- partial and total differentiation
- matrix operations including products, inverses, etc.
Definition Classes
smile
package classification
Classification algorithms.
Classification algorithms. In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given input object into one of a given number of categories. The input object is formally termed an instance, and the categories are termed classes.
The instance is usually described by a vector of features, which together constitute a description of all known characteristics of the instance. Typically, features are either categorical (also known as nominal, i.e. consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting of one of a set of ordered items, e.g. "large", "medium" or "small"), integer-valued (e.g. a count of the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure).
Classification normally refers to a supervised procedure, i.e. a procedure that produces an inferred function to predict the output value of new instances based on a training set of pairs consisting of an input object and a desired output value. The inferred function is called a classifier if the output is discrete or a regression function if the output is continuous.
The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.
A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems. The most widely used learning algorithms are AdaBoost and gradient boosting, support vector machines, linear regression, linear discriminant analysis, logistic regression, naive Bayes, decision trees, k-nearest neighbor algorithm, and neural networks (multilayer perceptron).
If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms cannot be easily applied. Many algorithms, including linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees (and boosting algorithms based on decision trees) is that they easily handle heterogeneous data.
If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.
If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, linear support vector machines, naive Bayes) generally perform well. However, if there are complex interactions among features, then algorithms such as nonlinear support vector machines, decision trees and neural networks work better. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.
There are several major issues to consider in supervised learning:
- Features: The accuracy of the inferred function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output. There are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. More generally, dimensionality reduction may seek to map the input data into a lower dimensional space prior to running the supervised learning algorithm.
- Overfitting: Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. The potential for overfitting depends not only on the number of parameters and data but also the conformability of the model structure with the data shape, and the magnitude of model error compared to the expected level of noise or error in the data. In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, Bayesian priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization. The basis of some techniques is either (1) to explicitly penalize overly complex models, or (2) to test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.
- Regularization: Regularization involves introducing additional information in order to solve an ill-posed problem or to prevent over-fitting. This information is usually of the form of a penalty for complexity, such as restrictions for smoothness or bounds on the vector space norm. A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
- Bias-variance tradeoff: Mean squared error (MSE) can be broken down into two components: variance and squared bias, known as the bias-variance decomposition. Thus in order to minimize the MSE, we need to minimize both the bias and the variance. However, this is not trivial. Therefore, there is a tradeoff between bias and variance.
Definition Classes
smile
package clustering
Clustering analysis.
Clustering analysis. Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields.
Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Many partitional clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.
Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold.
Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as two-way clustering, co-clustering or biclustering in bioinformatics: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods.
Definition Classes
smile
package data
Data manipulation functions.
Data manipulation functions.
Definition Classes
smile
package feature
Definition Classes
smile
package manifold
Manifold learning finds a low-dimensional basis for describing high-dimensional data.
Manifold learning finds a low-dimensional basis for describing high-dimensional data. Manifold learning is a popular approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high; though each data point consists of perhaps thousands of features, it may be described as a function of only a few underlying parameters. That is, the data points are actually samples from a low-dimensional manifold that is embedded in a high-dimensional space. Manifold learning algorithms attempt to uncover these parameters in order to find a low-dimensional representation of the data.
Some prominent approaches are locally linear embedding (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques construct a low-dimensional data representation using a cost function that retains local properties of the data, and can be viewed as defining a graph-based kernel for Kernel PCA. More recently, techniques have been proposed that, instead of defining a fixed kernel, try to learn the kernel using semidefinite programming. The most prominent example of such a technique is maximum variance unfolding (MVU). The central idea of MVU is to exactly preserve all pairwise distances between nearest neighbors (in the inner product space), while maximizing the distances between points that are not nearest neighbors.
An alternative approach to neighborhood preservation is through the minimization of a cost function that measures differences between distances in the input and output spaces. Important examples of such techniques include classical multidimensional scaling (which is identical to PCA), Isomap (which uses geodesic distances in the data space), diffusion maps (which uses diffusion distances in the data space), t-SNE (which minimizes the divergence between distributions over pairs of points), and curvilinear component analysis.
Definition Classes
smile
package math
Mathematical and statistical functions.
Mathematical and statistical functions.
Definition Classes
smile
package nlp
Natural language processing.
Natural language processing.
Definition Classes
smile
package plot
Data visualization.
Data visualization.
Definition Classes
smile
package regression
Regression analysis.
Regression analysis. Regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables. Therefore, the estimation target is a function of the independent variables called the regression function. Regression analysis is widely used for prediction and forecasting.
Definition Classes
smile
gpr
package sequence
Sequence labeling algorithms.
Sequence labeling algorithms.
Definition Classes
smile
package util
Utility functions.
Utility functions.
Definition Classes
smile
package validation
Model validation.
Model validation.
Definition Classes
smile
package wavelet
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero.
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. Like the fast Fourier transform (FFT), the discrete wavelet transform (DWT) is a fast, linear operation that operates on a data vector whose length is an integer power of 2, transforming it into a numerically different vector of the same length. The wavelet transform is invertible and in fact orthogonal. Both FFT and DWT can be viewed as a rotation in function space.
Definition Classes
smile

regression

package regression

Regression analysis. Regression analysis includes any techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables. Therefore, the estimation target is a function of the independent variables called the regression function. Regression analysis is widely used for prediction and forecasting.

Linear Supertypes

AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

regression
AnyRef
Any

Hide All
Show All

Visibility

Public
Protected

Value Members

def cart(formula: Formula, data: DataFrame, maxDepth: Int = 20, maxNodes: Int = 0, nodeSize: Int = 5): RegressionTree
Regression tree.
Regression tree. A classification/regression tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions.
The algorithms that are used for constructing decision trees usually work top-down by choosing a variable at each step that is the next best variable to use in splitting the set of items. "Best" is defined by how well the variable splits the set into homogeneous subsets that have the same value of the target variable. Different algorithms use different formulae for measuring "best". Used by the CART algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. Information gain is another popular measure, used by the ID3, C4.5 and C5.0 algorithms. Information gain is based on the concept of entropy used in information theory. For categorical variables with different number of levels, however, information gain are biased in favor of those attributes with more levels. Instead, one may employ the information gain ratio, which solves the drawback of information gain.
Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.
- Simple to understand and interpret: In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations, but can also often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner.
- Able to handle both numerical and categorical data: Other techniques are usually specialized in analyzing datasets that have only one type of variable.
- Nonparametric and nonlinear: The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analytics, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.
One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over-fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.
Some techniques such as bagging, boosting, and random forest use more than one decision tree for their analysis.
formula
a symbolic description of the model to be fitted.
data
the data frame of the explanatory and response variables.
maxDepth
the maximum depth of the tree.
maxNodes
the maximum number of leaf nodes in the tree.
nodeSize
the minimum size of leaf nodes.
returns
Regression tree model.
def gbm(formula: Formula, data: DataFrame, loss: Loss = Loss.lad(), ntrees: Int = 500, maxDepth: Int = 20, maxNodes: Int = 6, nodeSize: Int = 5, shrinkage: Double = 0.05, subsample: Double = 0.7): GradientTreeBoost
Gradient boosted regression trees.
Gradient boosted regression trees.
Generic gradient boosting at the t-th step would fit a regression tree to pseudo-residuals. Let J be the number of its leaves. The tree partitions the input space into J disjoint regions and predicts a constant value in each region. The parameter J controls the maximum allowed level of interaction between variables in the model. With J = 2 (decision stumps), no interaction between variables is allowed. With J = 3 the model may include effects of the interaction between up to two variables, and so on. Hastie et al. comment that typically 4 ≤ J ≤ 8 work well for boosting and results are fairly insensitive to the choice of in this range, J = 2 is insufficient for many applications, and J > 10 is unlikely to be required.
Fitting the training set too closely can lead to degradation of the model's generalization ability. Several so-called regularization techniques reduce this over-fitting effect by constraining the fitting procedure. One natural regularization parameter is the number of gradient boosting iterations T (i.e. the number of trees in the model when the base learner is a decision tree). Increasing T reduces the error on training set, but setting it too high may lead to over-fitting. An optimal value of T is often selected by monitoring prediction error on a separate validation data set.
Another regularization approach is the shrinkage which times a parameter η (called the "learning rate") to update term. Empirically it has been found that using small learning rates (such as η < 0.1) yields dramatic improvements in model's generalization ability over gradient boosting without shrinking (η = 1). However, it comes at the price of increasing computational time both during training and prediction: lower learning rate requires more iterations.
Soon after the introduction of gradient boosting Friedman proposed a minor modification to the algorithm, motivated by Breiman's bagging method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.
Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm is deterministic and identical to the one described above. Smaller values of f introduce randomness into the algorithm and help prevent over-fitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Typically, f is set to 0.5, meaning that one half of the training set is used to build each base learner.
Also, like in bagging, sub-sampling allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.
Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes. It's used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves.
References:
- J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine, 1999.
- J. H. Friedman. Stochastic Gradient Boosting, 1999.
formula
a symbolic description of the model to be fitted.
data
the data frame of the explanatory and response variables.
loss
loss function for regression. By default, least absolute deviation is employed for robust regression.
ntrees
the number of iterations (trees).
maxDepth
the maximum depth of the tree.
maxNodes
the maximum number of leaf nodes in the tree.
nodeSize
the minimum size of leaf nodes.
shrinkage
the shrinkage parameter in (0, 1] controls the learning rate of procedure.
subsample
the sampling fraction for stochastic tree boosting.
returns
Gradient boosted trees.
def lasso(formula: Formula, data: DataFrame, lambda: Double, tol: Double = 1E-3, maxIter: Int = 5000): LinearModel
Least absolute shrinkage and selection operator.
Least absolute shrinkage and selection operator. The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients (i.e. L₁-regularized). It has connections to soft-thresholding of wavelet coefficients, forward stage-wise regression, and boosting methods.
The Lasso typically yields a sparse solution, of which the parameter vector β has relatively few nonzero coefficients. In contrast, the solution of L₂-regularized least squares (i.e. ridge regression) typically has all coefficients nonzero. Because it effectively reduces the number of variables, the Lasso is useful in some contexts.
For over-determined systems (more instances than variables, commonly in machine learning), we normalize variables with mean 0 and standard deviation 1. For under-determined systems (less instances than variables, e.g. compressed sensing), we assume white noise (i.e. no intercept in the linear model) and do not perform normalization. Note that the solution is not unique in this case.
There is no analytic formula or expression for the optimal solution to the L₁-regularized least squares problems. Therefore, its solution must be computed numerically. The objective function in the L₁-regularized least squares is convex but not differentiable, so solving it is more of a computational challenge than solving the L₂-regularized least squares. The Lasso may be solved using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least angle regression algorithm.
References:
- R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267-288, 1996.
- B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani. Least angle regression. Annals of Statistics, 2003
- Seung-Jean Kim, K. Koh, M. Lustig, Stephen Boyd, and Dimitry Gorinevsky. An Interior-Point Method for Large-Scale L1-Regularized Least Squares. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 1, NO. 4, 2007.
formula
a symbolic description of the model to be fitted.
data
the data frame of the explanatory and response variables.
lambda
the shrinkage/regularization parameter.
tol
the tolerance for stopping iterations (relative target duality gap).
maxIter
the maximum number of iterations.
def lm(formula: Formula, data: DataFrame, method: String = "qr", stderr: Boolean = true, recursive: Boolean = true): LinearModel
Fitting linear models (ordinary least squares).
Fitting linear models (ordinary least squares). In linear regression, the model specification is that the dependent variable is a linear combination of the parameters (but need not be linear in the independent variables). The residual is the difference between the value of the dependent variable predicted by the model, and the true value of the dependent variable. Ordinary least squares obtains parameter estimates that minimize the sum of squared residuals, SSE (also denoted RSS).
The OLS estimator is consistent when the independent variables are exogenous and there is no multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances.
There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results, the only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data at hand, and on the inference task which has to be performed.
Least squares corresponds to the maximum likelihood criterion if the experimental errors have a normal distribution and can also be derived as a method of moments estimator.
Once a regression model has been constructed, it may be important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include the R-squared, analysis of the pattern of residuals and hypothesis testing. Statistical significance can be checked by an F-test of the overall fit, followed by t-tests of individual parameters.
Interpretations of these diagnostic tests rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results of a t-test or F-test are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions and complicate inference. With relatively large samples, however, a central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic approximations.
formula
a symbolic description of the model to be fitted.
data
the data frame of the explanatory and response variables.
method
the fitting method ("svd" or "qr").
recursive
if true, the return model supports recursive least squares.
def randomForest(formula: Formula, data: DataFrame, ntrees: Int = 500, mtry: Int = 0, maxDepth: Int = 20, maxNodes: Int = 500, nodeSize: Int = 5, subsample: Double = 1.0): RandomForest
Random forest for regression.
Random forest for regression. Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.
Each tree is constructed using the following algorithm:
1. If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
The advantages of random forest are:
- For many data sets, it produces a highly accurate classifier.
- It runs efficiently on large data sets.
- It can handle thousands of input variables without variable deletion.
- It gives estimates of what variables are important in the classification.
- It generates an internal unbiased estimate of the generalization error as the forest building progresses.
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
The disadvantages are
- Random forests are prone to over-fitting for some datasets. This is even more pronounced on noisy data.
- For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.
formula
a symbolic description of the model to be fitted.
data
the data frame of the explanatory and response variables.
ntrees
the number of trees.
mtry
the number of input variables to be used to determine the decision at a node of the tree. dim/3 seems to give generally good performance, where dim is the number of variables.
maxDepth
the maximum depth of the tree.
maxNodes
the maximum number of leaf nodes in the tree.
nodeSize
the minimum size of leaf nodes.
subsample
the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
returns
Random forest regression model.
def rbfnet(x: Array[Array[Double]], y: Array[Double], k: Int, normalized: Boolean = false): RBFNetwork[Array[Double]]
Trains a Gaussian RBF network with k-means.
def rbfnet[T <: AnyRef](x: Array[T], y: Array[Double], neurons: Array[RBF[T]], normalized: Boolean): RBFNetwork[T]
Radial basis function networks.
Radial basis function networks. A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. It is a linear combination of radial basis functions. They are used in function approximation, time series prediction, and control.
In its basic form, radial basis function network is in the form
y(x) = Σ w_i φ(||x-c_i||)
where the approximating function y(x) is represented as a sum of N radial basis functions φ, each associated with a different center c_i, and weighted by an appropriate coefficient w_i. For distance, one usually chooses Euclidean distance. The weights w_i can be estimated using the matrix methods of linear least squares, because the approximating function is linear in the weights.
The centers c_i can be randomly selected from training data, or learned by some clustering method (e.g. k-means), or learned together with weight parameters undergo a supervised learning processing (e.g. error-correction learning).
The popular choices for φ comprise the Gaussian function and the so called thin plate splines. The advantage of the thin plate splines is that their conditioning is invariant under scalings. Gaussian, multi-quadric and inverse multi-quadric are infinitely smooth and and involve a scale or shape parameter, r₀ > 0. Decreasing r₀ tends to flatten the basis function. For a given function, the quality of approximation may strongly depend on this parameter. In particular, increasing r₀ has the effect of better conditioning (the separation distance of the scaled points increases).
A variant on RBF networks is normalized radial basis function (NRBF) networks, in which we require the sum of the basis functions to be unity. NRBF arises more naturally from a Bayesian statistical perspective. However, there is no evidence that either the NRBF method is consistently superior to the RBF method, or vice versa.
SVMs with Gaussian kernel have similar structure as RBF networks with Gaussian radial basis functions. However, the SVM approach "automatically" solves the network complexity problem since the size of the hidden layer is obtained as the result of the QP procedure. Hidden neurons and support vectors correspond to each other, so the center problems of the RBF network is also solved, as the support vectors serve as the basis function centers. It was reported that with similar number of support vectors/centers, SVM shows better generalization performance than RBF network when the training data size is relatively small. On the other hand, RBF network gives better generalization performance than SVM on large training data.
References:
- Simon Haykin. Neural Networks: A Comprehensive Foundation (2nd edition). 1999.
- T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE 78(9):1484-1487, 1990.
- Nabil Benoudjit and Michel Verleysen. On the kernel widths in radial-basis function networks. Neural Process, 2003.
x
training samples.
y
response variable.
neurons
the radial basis functions.
normalized
train a normalized RBF network or not.
def ridge(formula: Formula, data: DataFrame, lambda: Double): LinearModel
Ridge Regression.
Ridge Regression. When the predictor variables are highly correlated amongst themselves, the coefficients of the resulting least squares fit may be very imprecise. By allowing a small amount of bias in the estimates, more reasonable coefficients may often be obtained. Ridge regression is one method to address these issues. Often, small amounts of bias lead to dramatic reductions in the variance of the estimated model coefficients. Ridge regression is such a technique which shrinks the regression coefficients by imposing a penalty on their size. Ridge regression was originally developed to overcome the singularity of the X'X matrix. This matrix is perturbed so as to make its determinant appreciably different from 0.
Ridge regression is a kind of Tikhonov regularization, which is the most commonly used method of regularization of ill-posed problems. Another interpretation of ridge regression is available through Bayesian estimation. In this setting the belief that weight should be small is coded into a prior distribution.
formula
a symbolic description of the model to be fitted.
data
the data frame of the explanatory and response variables.
lambda
the shrinkage/regularization parameter.
def svm[T <: AnyRef](x: Array[T], y: Array[Double], kernel: MercerKernel[T], eps: Double, C: Double, tol: Double = 1E-3): KernelMachine[T]
Support vector regression.
Support vector regression. Like SVM for classification, the model produced by SVR depends only on a subset of the training data, because the cost function ignores any training data close to the model prediction (within a threshold).
T
the data type
x
training data.
y
response variable.
kernel
the kernel function.
eps
the loss function error threshold.
C
the soft margin penalty parameter.
tol
the tolerance of convergence test.
returns
SVR model.
object gpr
Gaussian Process for Regression.

Packages

References:

regression

package regression

Value Members

References:

References:

References:

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

References:

regression

package regression

Value Members

References:

References:

References:

Inherited from AnyRef

Inherited from Any

Ungrouped

regression